The Big Data Theory
Mar 14, 2012 by Jim Harris in Big Data, Data Management, Data Quality
When the American radio astronomers Arno Penzias and Robert Wilson were setting up a new radio telescope at AT&T Bell Labs, they decided to point it towards deep space where they expected a silent signal that could be used to calibrate their equipment.
However, instead of silence, they heard a persistent noise, a seemingly meaningless background static they initially mistook as an indication their telescope was faulty equipment in need of repair.
For almost a year, they functioned off of this assumption. At one point, they pondered whether the cause of the static might be the excessive amount of pigeon poop accumulating on their telescope.
But even after spending a month meticulously cleaning it, when they pointed the telescope towards deep space, once again they heard the same persistent noise. (At which point, although it is not included in the official scientific record, I imagine stronger language than “poop” was uttered.)
However, after analyzing what they initially thought was the crappiest possible data produced by a broken telescope, they challenged their own assumptions, and, by doing so, discovered what was data of the highest possible quality, which revealed, in a classic example of mistaking signal for noise, one of the greatest scientific breakthroughs of twentieth-century physics.
Arno Penzias and Robert Wilson won the 1978 Nobel Prize in Physics for discovering what’s now known as cosmic microwave background radiation. In other words, in the Big Data raining down from Big Sky, they managed to hear the remnants of the Big Bang.
Penzias and Wilson helped the Big Bang Theory defeat its primary rival, the Steady State Theory, as the prevailing scientific model of the universe. The Big Data Theory is now challenging steady state theories that were the bedrock of the status quo within the data management industry for decades.
Although I don’t doubt the theoretical potential of big data, I remain cautiously optimistic about big data becoming the prevailing data model of the business universe because, when performing business analysis on data sets of any size, it isn’t always easy to tell the difference between a meaningful business insight and a data quality issue (or faulty equipment or a measurement calibration error).
As the excellent example of Arno Penzias and Robert Wilson demonstrated, it isn’t always easy to tell the difference between what we see and what we are looking for.
Big data will deliver more signals, not just more noise, but will we always be able to tell difference?





Rich Murnane
Mar 14, 2012
Very thoughtful post Jim! I love the signal vs. noise metaphor. Best…Rich
Jim Harris
Mar 14, 2012
Thanks for your comment, Rich.
One of the reasons I like the Penzias and Wilson story is that it illustrates the opposite of the most common data quality debate regarding big data, namely that it will bring more noise than signal.
Most advocates of big data emphasize the value of outlier analysis (e.g., fraud in financial transactions) without acknowledging the possibility that outliers could be caused by data quality issues.
The Penzias and Wilson story is the opposite challenge, where what the entire (or vast majority of the) data set represents will be resisted as an insight because it contradicts the preconceptions of the people performing the analysis.
Best Regards,
Jim