I think that the biggest issue with integrating external data into the organization (especially for business intelligence purposes) is related to the question of data repurposing. It is one thing to consider data sharing for cross-organization business processes (such as brokering transactions between two different trading partners) because those data exchanges are governed by well-defined standards. It is another when your organization is tapping into a data stream created for one purpose to use the data for another purpose, because there are no negotiated standards.
In the best of cases, you are working with some published metadata. In my previous post I referred to the public data at www.data.gov, and those data sets are sometimes accompanied by their data layouts or metadata. In the worst case, you are integrating a data stream with no provided metadata. In both cases, you, as the data consumer, must make some subjective judgments about how that data can be used.
For example, I was just checking out one of the public data sets, and there was a link to the “database record layout.” Following that link provided me with a two-column table with the first column labeled “Label” (i.e., column name) and the second “Character Limit.” Some of the columns had pretty generic names: “LASTNAME,” “FIRSTNAME,” “MIDNAME,” “BUSNAME,” “GENERAL,” etc. And while I might assume that BUSNAME means “business name,” that is really just an assumption – the (generally minimal) data layout descriptions provide little more information than what I just provided here.
Determining the usability of the data is truly subjective, though, and might have to be based on a profile of the data once we already figured out how to bring it into the organization. I might decide to use this data for a completely different use than the original intent, and I have to infer the data set’s meaning for my own purpose. In other words, the subjective assessment of usability (and quality) of the data must be based on my inference, not on the original intent. Yet some folks seem to get this difference mixed up when it comes to quality assurance. More on this in the post after next*…