In my last entry I started to ask about the difference between the current view of “proactive data quality,” in which we really mean “early reactivity to known errors” as opposed to truly proactive data quality in which we anticipate, find, and eliminate errors latent within our existing contexts. I gave two examples of error detection.
The first case was an unanswered mailed invitation sent to the wrong address. The curious thing about this error is its longevity. The address on file had been wrong for five years. It was not any wronger this year than it was in 2006, but the error was irrelevant until the data was accessed and used. In other words, if we were to look at the frequency of use of the data, we might infer (although at this point without any justification) that data values untouched for a long time are more likely to be incorrect. In the second case it was not the data at fault, but the process. In addition, I already knew the process had occasion to fail, which is the reason for the intermittent manual review.
Basically, in both cases there are characteristics of the data and associated processes that are prone to some kind of failure. For the mail scenario, there are known statistics regarding the rate at which people change addresses. Coupled with the infrequent access of the data, there might be statistical models that can suggest a few different probabilities of error. Likewise for the spam filter failures – we know that some good emails get flagged (as well as numerous spammy emails that fail to get caught by the filter). Some potential areas for investigation include:
- The probability that an error exists within a data set
- The duration of the existence of errors within a data set
- The probability that any randomly selected record has an error
- The probability that a specific record has an error
The question you might ask now is this: what information external to the data set influences the occurrence of errors within the data set?