Someone asked me the following question this week: “Where is the best place to start data profiling for the first time?”.
One obvious answer is “where most data quality issues are,” but that’s not much help when you’re starting out, so the advice I gave was to “check your data plumbing.”
Several years ago we bought a “fixer-upper” in a bid to build the house of our dreams. 3 years on (and 3 years behind schedule) we eventually got around to the plumbing. In short, it’s a nightmare. Every 5 minutes the plumber comes thundering down the ladder with those chilling words “… you’ll never <expletive> believe what the previous owners have done in the loft/kitchen/lounge/dining room…”
You get the picture…
The reason I tell people to inspect their data plumbing first is that, like my ageing home plumbing, people take short cuts over time. Also, as time goes by, more and more temporary fixes and connections are added to the data landscape and issues start to creep in.
In plumbing, problems occur where you joint a pipe to another pipe or system. This creates a point of weakness because the interface between the two connections will eventually deteriorate.
This happens in data plumbing all the time.
ETL specifications become out of date. Data dictionaries lose touch with reality. Coding hacks and temporary fixes become permanent fixtures. Over time changes to data rules, application designs and feed formats take a toll on the quality of data flowing across these “plumbing joints.”
Given the high volumes of data we now have, it is often difficult for companies to spot issues across these connection points. A lot of companies I’ve worked with in the past have only performed rudimentary record counts and sporadic auditing, making it very easy for problems to slip through unnoticed.
Data profiling can help in these situations. By examining data across all sides of these data connection points you can:
- Determine accurate metadata and data quality rules so you can compare and update documented standards (or create them for the first time!)
- Identify trends in quality over time, detecting the cause and date of failure
- Create accurate specifications for ongoing data quality management/monitoring (you won’t necessarily use a data profiling tool for this but it’s a great starting point)
- Highlight the need for data quality management by demonstrating the impact of ignorance
The list goes on, and maybe you can add some more of your own in the comments below, but my main point is that if you’re looking for a great starting point for your data profiling tool, then look for where the cracks are forming in your data plumbing. You can quickly pick a rich crop of low-hanging fruit and get that data quality awareness campaign off to a flying start.