Tag Archives: Data Enrichment
Mar 27, 2012 by David Loshin
At the end of my last blog post, I posed a question about the potential failure to observe data policies for data items that are shared multiple times, or are shared outside of the administrative domain. In the interim between writing that last entry and this one, though, I shared my thoughts with a colleague, who pointed out that very often data that is supposed to be used for one purpose is used to infer new pieces of information and knowledge. But what happens if that new piece of knowledge is, by virtue of the inference, an exposure of protected information that was not inherent in any of the original data sets?
Sep 12, 2011 by Rich Murnane
The U.S. Bureau of Labor Statistics (BLS) is the federal agency responsible for “measuring labor market activity, working conditions, and price changes in the economy.” I stumbled upon a recently posted report titled “Job Openings and Labor Turnover Summary,” which I’m very excited to share with you.
Jul 21, 2011 by Phil Simon
In a way, it’s easy to work in sales, R&D, marketing, and other “top line” departments. After all, when times are good, people like to brag about how sales have increased, they’ve created innovative new products, and customers are buzzing about their latest commercials and campaigns. (Of course, times aren’t always good, but that’s a separate discussion.)
Jul 05, 2011 by David Loshin
The last post considered a business justification for determining when a single entity is using multiple identities, and we came to the conclusion that traditional householding was the right approach for this type of analysis. In this post, though, I want to consider the business impacts associated with MD2s, in which multiple entities share a single identity.
Jun 16, 2011 by Phil Simon
Often specialists get caught up in what they’re doing and miss the big picture. Techies sometimes focus on the technology, not the business. Ditto some data management professionals, marketers, and HR folks. It’s important to remember that always it’s about the business, not the “other thing.” We all serve the business. If something or someone is not serving it well, then changes must be made.
Nov 18, 2010 by Phil Simon
For better or (mostly) worse, in my professional career, I have consistently found myself on projects suffering from a bevy of issues, many of which were related to data. By 2008, I had reached a tipping point: I was either going to write a book about IT project failures or see a shrink. I chose the former.
In other words, it’s rare that, as a consultant, I have the ability to influence the direction of an organization’s data management. I find myself these days in such a place. The details of my project aren’t particularly interesting to the average reader. For now, however, suffice it to say that I am building a little ETL tool that takes a bunch of data from a bunch of places, transforms it, and spits it out to a bunch of people. I’d give this about a 4 on my 1-10 scale for complexity. (Yes, I have had to build tools that scored a 14 on that same 1-10 scale before. Take me out for a beer sometime and I’ll tell you a story or two.)
Oct 14, 2010 by Phil Simon
Times are tight all over, especially in the banking world. Let’s just say that my phone hasn’t been ringing off the hook with calls from financial institutions over the last few years. I suspect I’m not alone here. All companies are trying to save money these days. Most that are looking for consultants want to find those with the lowest possible rates.
Ho hum… What else is new?
Well, despite my rants on this site, on occasion a company does the right thing. It focuses on long-term data management, finding the right resource to help them get from point A to point B. This happened to me just a few hours after I returned from IDEAS 2010. (I’ll address some of the lessons learned from the conference in subsequent posts.)
May 04, 2010 by David Loshin
There is a general perception that by installing and populating an MDM tool, the organization immediately benefits from the consolidation of multiple representations of data into a single “golden record.” Also referred to as a “single source of truth,” this concept suggests that a byproduct of data consolidation is the materialization of one representation whose quality and correctness exceed that of any other representation for any application purpose.
Apr 13, 2010 by David Loshin
When two records have been determined to refer to the same real world entity, it means that for some reason a duplicate version of what should be a unique record has been introduced. Whether it is due to a merging of data sets from different sources or the absence of controls to prevent duplicates from being entered is irrelevant if the business objective is to resolve multiple records into a single representation. The challenge is that when faced with two (or possibly) more versions of information supposedly representing the same entity, how do you determine which values from which records will be copied over into the unified record?
The answer to this question reflects a philosophical standpoint regarding the question of “correcting bad data,” namely whether one should ever delete or overwrite values assumed to be incorrect. From one perspective, inconsistent data will have impacts to the business users downstream, and reducing or eliminating inconsistency leads to improved business processes. From the other perspective, any piece of information is valuable, and deleting one version of a person’s name or a product description because they don’t match another version means the deleted version will be lost. Therefore, in some situations, a “best” record can be created that is used to update all identified duplicates, while in other situations, all the values are maintained while the “best” version can be materialized on demand when requested.
This process of determining the “best” values is called survivorship. Of course, if all the values are the same, survivorship is not questioned. But when there are variant data values, survivorship decisions must be related to additional measures of quality. These quality measures are a function of three contextual factors: the quality of the data source from which the record comes, the measure of quality of the record (i.e., based on defined dimensions), and the quality of the specific values.