There’s an old statistical phenomenon which states that in listings of numeric data, the first digit of about 30% of the numeric values should start with the value of 1 (and ~17% should start with the number 2, 12.5% should start with the number 3, etc…). The phenomenon is known as Benford’s law and it is named after a physicist named Frank Benford (1938). According to the all-knowing Wikipedia however, “it had been previously stated by Simon Newcomb in 1881″. Some folks call the phenomenon the “first-digit law”, probably because they can’t remember Benford’s name (or maybe it’s because they think it should be named after Newcomb?).
Testing the “Law”, because seeing is believing
I recently came across a blog post which suggested that Benford’s law might prove to be a useful tool when evaluating the quality of numeric data. For decades financial auditors have been using Benford’s law to look for anomalies in financial data and according to the Wolfram website Benford’s law was used by the character Charlie Eppes as an analogy to help solve a series of high burglaries in the Season 2 “The Running Man” episode (2006) of the television crime drama NUMB3RS.
Numb3rs was one of my favorite shows (I’m a geek), and if Charlie Eppes can use Benford’s Law to solve crime, why not try to use it for evaluating the quality of data? I decided to give this a try so I downloaded a couple large (big) data sets I found on an Open Data website. After performing some “substring” and “group by” analysis, I found the distribution of the first-digits of the financial numeric data lined up almost exactly with the percentages. Take a look at this graphic to see the results of my test.
But how does this relate to data quality?
One of the fundamental principles of data quality is to understand the data you have. Typically this is done by profiling your data, and using a profiling tool like SAS DataFlux makes doing this quite easy. The results of such profiling activities include things like counts of nulls, frequency distribution reports, min/max values, pattern frequency distribution, percentiles, etc…. Wouldn’t it be awesome if profiling tools added distributions of the “first-digit” (along with comparisons to Benford’s Law) to their results?
What to do if your data doesn’t “line up” with Benford’s Law?
I struggled with this one a bit. Certain types of numeric data may just not comply with Benford’s Law; I guess it depends on what data you are working with. I imagine a good example of this might be the products for a particular supplier’s pricing data could, in fact, usually start with a different value. I’m thinking that you can use Benford’s Law against numeric data in addition to other types of profile tests, but by no means should you declare your data to have significant quality issues just because it does not comply with this phenomenon. If your data does not “line up”, it’d suggest you need to investigate further, but don’t jump to conclusions without doing your homework.
Some other things to consider
Have you ever heard of “psychological pricing” or “price ending”? This is the practice marketers and pricing strategists use where they make retail prices for items “a little less than a round number” (think $9.99 instead of $10). Maybe you could consider evaluating your retail price data to see if your pricing folks are complying with this practice?
What about evaluating the names of people in your mail lists? I’m sure a certain percentage of any list of American names should have a first name of John, and a certain percentage of last names should be Smith. In fact, the US Social Security Administration lists John as the fifth most popular name for men born in the US during the 1970s. Is John the fifth most popular name for men in this demographic within your data? If it isn’t (or isn’t close to fifth), why is that? Perhaps this statistical phenomenon should be called the “John Smith Law”, or maybe it already has a name.
Psychological pricing, the “John Smith Law”, and Benford’s law are all similar for data quality purposes because analysis like this can be used to better understand your data and look for anomalies which you might not expect in “normal” data sets. Identification is the first part, and the good data quality analyst will not just raise the alarm that this happened, but they will dig deeper and determine the appropriate steps to take next.
Hopefully this post has given you some insight on leveraging a little bit of statistics to further enhance your data management program. If anyone out there would like to provide other use cases of how they use Benford’s Law in the real world, so please post a comment if you can. I’d love to hear about it. Until I hear otherwise, I can statistically assert that if you try to fight this law, chances are the law is going to win.
Until next time…Rich