Data attributed that have misleading names are like ticking time bombs awaiting the wrong scenario. While we were considering ideas for mitigating an existing issues related to data attributes with names that did not correctly describe the values the attribute held, one of my colleagues noted that the problem was much more insidious than bad naming. Changing the attributes’ named would not work, since apparently there were numerous applications that had been designed based on those same mistaken assumptions.
Assessing the actual impact of the problem was beyond what could be identified through data profiling. Any time that data attribute’s value was accessed within an application, its value might be compared with one or more values that are directly coded within the application. For example, let’s consider an attribute called “product_category,” which contains a value between 0 and 6, and knowledge of a separate reference table called categories that holds 7 values labeled 0 through 6, each with a separate product type and description. The “product_category” field might be presumed to indicate the type of product the record represents (classifying the product set into the 7 categories listed in the reference table), but in fact the field captures data about the factory at which that product line is manufactured.
So here is the problem: somewhere in the code, a programmer might be testing for “product_category” = 0, thinking that the attribute is referring to the reference table without realizing that the two value sets are distinct. The result is that we have a metadata dependency that is directly embedded within the application. This hard-coded metadata is impossible to track, since there are no referential integrity constraints associated between data tables and program code.
The upshot is that any time you want to directly harmonize the name of an attribute with its use, you need to do a blanket assessment of not just the locations where the attribute is used, but also understand how the attribute is presumed to be used. We are back at the same problem we started with: the identification of a potential problem that is masked by nomenclature.