In his blog post Big Data Quality: Think Outside the Box, Mark Troester explained that it doesn’t make sense to apply one data quality approach for all data. Instead, you should customize your data quality approach based on “where the data came from, how the data will be used, how the data will be consumed, who will use the data, and perhaps most importantly, what decisions will be made with the data.” I agree since it’s similar to what I call The First Law of Data Quality.
One of the many interesting big data points Troester made was about the quality of self-reported data, explaining how the “information that people self-report about medication taken, time spend studying, etc., is often misrepresented by the user (they intentionally fabricate the amount of meds taken, time spent studying, etc.).” This got me thinking about other examples of self-reported data, especially the volume and variety of user-generated data on the Internet that is being applied in business areas such as market segmentation, campaign effectiveness, consumer behavior, and sentiment analysis.
Consider the following examples:
- Are we really “friends” with the people we connect with or customers of the products and services we “like” on social networking websites?
- Do we really read the books we review on Amazon or the content we re-tweet on Twitter?
- When we complete an online survey (e.g., for a chance to win a new iPad), do we honestly answers questions like “Annual family income”?
- Do all of the job titles and keywords in our LinkedIn profile reflect our actual professional experience? (Just in case anyone asks, I was a Vice President at Vandelay Industries.)
- When we sign up for a free trial of a web service or download a white paper, do we provide an active email address or select the country we actually live in from the drop-down list?
Evaluating the accuracy of any type of data can be challenging, but the accuracy of self-reported data can be further complicated by the lies we tell data. Well, maybe we don’t tell lies, but at the very least we have to admit that the truthiness of our self-reported data makes it rather quality-ish.
Self-reported data can still be valuable for business applications, but, as Troester recommended, we have to customize our data quality approach for it. Traditional data quality techniques used with other data types might not work as well with self-reported data.
And you know that’s the truth because a data quality blog post never lies