Data migration projects are an excellent vehicle for helping companies shine a huge, megawatt-sized lantern on where their data quality has historically been ignored.
Often the target system has been acquired and only during the migration project itself are all those wonders of legacy computing unravelled. This is why I always plead with companies to get started with data profiling and discovery even before the project is formally signed off because you never have enough time to do the job completely. If you want to create a Zero Defect Data Migration however, you’ll be glad of that extra analysis.
So once you’ve got started with the initial data profiling and discovery you’ll start to discover some issues. They will span the full spectrum of data quality dimensions and force you to make many decisions but the biggest decision (if you need to retain the data) will invariably be: “When and where do we cleanse?”
In this situation you often find some heavy handed policy making at play. I’ve seen large, complex and mission critical projects kick off with the mantra:
“There is no time for cleansing, we’ll take the data as-is.”
This “let’s fix it in the target” is a flawed approach because essentially the project leader in this example was saying:
“We’ll force the business users to cleanse the data after we’ve gone live and they’ve already got their hands full with learning the new system and operating procedures.”
So one of the first things I would say is that you want to be improving your data in the most practical environment possible. Is that the target immediately after go-live? You decide.
Next comes a policy for “fixing all the data in the staging area.” The staging area is a kind of no-man’s land where the data sits prior to migration. It allows you to transform the legacy data more closely into the target structures so that the final load can be validated and uploaded more easily.
The problem with cleansing data in the staging area is managing consistency and audit trails. If the migration load fails and a re-export of legacy data is required then you need to know exactly which edits were made so they can be re-applied to the fresh data. Obviously there are advantages of cleansing in the staging area because you’re not impacting either legacy or target system, staging areas are offline storage and processing areas.
The other big problem with staging areas however is that for many companies they’re impractical from a timing perspective. They simply can’t afford the time delay of exporting data to a temporary system, processing it, validating it and then finally loading it. During this time the legacy data could have changed (think 24/7) making a complicated refresh in the target required to keep everything in-sync.
Another alternative to offline staging areas is to use your data migration software to manage the cleansing activities “in-flight” as we say, or online. Using this approach we create a series of lookups or cleansing code which “fix” the data during the actual migration execution.
This is a great approach for speed and audit trailing because you have an accurate view of which logic has been applied. The problem is that because the data is constantly changing you can never be entirely sure of what results you’ll get. Still, it is a viable approach, particularly as there is a huge amount of data processing taking place during the migration anyway and most data quality functions carried out by tools are transformations of some sort.
Finally, we have the “cleanse in legacy” option. Here you’re resolving issues in the original source systems. Perhaps your old system didn’t need to retain the warranty id of equipment so it was only 80% complete. In the new system a warranty id is mandatory so a value needs to be present. Back you go over the old records updating the missing data.
Fixing data in legacy always makes more sense because you’re giving value back to the current users of the system. Some migrations can take months, even years, so by improving quality in legacy systems you’re driving tangible value back into the business for long periods of time.
The problem with legacy data cleansing is that you may inadvertently break the old processes by modifying the data. For example, the new target system may not permit non-numerical characters in the telephone field but in the legacy system they used a “*” prefix to denote a mobile number. By removing the prefix the old customer text alert system will fail.
When and where to cleanse your data in a data migration is not straightforward so don’t publish carte blanche policies. Profile your data early and examine the different types of issue. What you’ll find is that multiple approaches will be required. Sometimes for instance it really does make sense to cleanse your data in the target and sometimes you absolutely cannot fix the data in the legacy system.
Above all however, don’t ignore the need for upfront data quality analysis to help you determine the cleansing strategy for each issue that you find. Never let anyone tell you that this process is optional or too costly or too time consuming. Cutting time and cost on activities like this is a false economy and the fast-track to migration failure.
the Data Roundtable.