When and Where to Cleanse in Data Migration

When and Where to Cleanse in Data Migration

Jul 06, 2012 by in Data Migration, Data Quality

Data migration projects are an excellent vehicle for helping companies shine a huge, megawatt-sized lantern on where their data quality has historically been ignored.

Often the target system has been acquired and only during the migration project itself are all those wonders of legacy computing unravelled. This is why I always plead with companies to get started with data profiling and discovery even before the project is formally signed off because you never have enough time to do the job completely. If you want to create a Zero Defect Data Migration however, you’ll be glad of that extra analysis.

So once you’ve got started with the initial data profiling and discovery you’ll start to discover some issues. They will span the full spectrum of data quality dimensions and force you to make many decisions but the biggest decision (if you need to retain the data) will invariably be: “When and where do we cleanse?”

In this situation you often find some heavy handed policy making at play. I’ve seen large, complex and mission critical projects kick off with the mantra:

“There is no time for cleansing, we’ll take the data as-is.”

This “let’s fix it in the target” is a flawed approach because essentially the project leader in this example was saying:

“We’ll force the business users to cleanse the data after we’ve gone live and they’ve already got their hands full with learning the new system and operating procedures.”

So one of the first things I would say is that you want to be improving your data in the most practical environment possible. Is that the target immediately after go-live? You decide.

Next comes a policy for “fixing all the data in the staging area.” The staging area is a kind of no-man’s land where the data sits prior to migration. It allows you to transform the legacy data more closely into the target structures so that the final load can be validated and uploaded more easily.

The problem with cleansing data in the staging area is managing consistency and audit trails. If the migration load fails and a re-export of legacy data is required then you need to know exactly which edits were made so they can be re-applied to the fresh data. Obviously there are advantages of cleansing in the staging area because you’re not impacting either legacy or target system, staging areas are offline storage and processing areas.

The other big problem with staging areas however is that for many companies they’re impractical from a timing perspective. They simply can’t afford the time delay of exporting data to a temporary system, processing it, validating it and then finally loading it. During this time the legacy data could have changed (think 24/7) making a complicated refresh in the target required to keep everything in-sync.

Another alternative to offline staging areas is to use your data migration software to manage the cleansing activities “in-flight” as we say, or online. Using this approach we create a series of lookups or cleansing code which “fix” the data during the actual migration execution.

This is a great approach for speed and audit trailing because you have an accurate view of which logic has been applied. The problem is that because the data is constantly changing you can never be entirely sure of what results you’ll get. Still, it is a viable approach, particularly as there is a huge amount of data processing taking place during the migration anyway and most data quality functions carried out by tools are transformations of some sort.

Finally, we have the “cleanse in legacy” option. Here you’re resolving issues in the original source systems. Perhaps your old system didn’t need to retain the warranty id of equipment so it was only 80% complete. In the new system a warranty id is mandatory so a value needs to be present. Back you go over the old records updating the missing data.

Fixing data in legacy always makes more sense because you’re giving value back to the current users of the system.  Some migrations can take months, even years, so by improving quality in legacy systems you’re driving tangible value back into the business for long periods of time.

The problem with legacy data cleansing is that you may inadvertently break the old processes by modifying the data. For example, the new target system may not permit non-numerical characters in the telephone field but in the legacy system they used a “*” prefix to denote a mobile number. By removing the prefix the old customer text alert system will fail.

In Summary

When and where to cleanse your data in a data migration is not straightforward so don’t publish carte blanche policies. Profile your data early and examine the different types of issue. What you’ll find is that multiple approaches will be required. Sometimes for instance it really does make sense to cleanse your data in the target and sometimes you absolutely cannot fix the data in the legacy system.

Above all however, don’t ignore the need for upfront data quality analysis to help you determine the cleansing strategy for each issue that you find. Never let anyone tell you that this process is optional or too costly or too time consuming. Cutting time and cost on activities like this is a false economy and the fast-track to migration failure.

Read more posts by Dylan Jones on
the Data Roundtable.

4 Responses to “When and Where to Cleanse in Data Migration”

  1. Christophe

    Jul 07, 2012

    Cleanse the data is one thing, but at the same time you need to understand the reasons of data issues, and find solutions to avoid this situation in the future.

    Otherwise you will have to cleanse the data again and again in the target system.

    Reply to this comment
  2. Dylan Jones

    Jul 11, 2012

    Hi Christophe

    Thanks for dropping by.

    Yes, I totally agree. My preference is always to get a leap on the Data Migration project by resolving issues at source some months before so that the data naturally increases in quality leading up to the migration.

    The big problem with many migrations of course is that unlike traditional data quality programs you don’t have the luxury of time, if the business requires a specific dataset and is incomplete for example, you still need to go out there and find the data required or cleanse/transform existing data that has been incorrectly keyed for example.

    So yes, I agree, root cause resolution is my mantra too but data migration is a very special case that requires special consideration to manage the cleansing activity.

    Thanks for your comment, appreciated.

    Reply to this comment
  3. David Walker

    Jul 15, 2012

    Reply to this comment
  4. Lars Nielsen

    Jul 20, 2012

    Christophe, that is an excellent point. I’ve found time and time again, that it is better to start the cleanup before the actual migration. Unfortunately or fortunately (depending on who you’re talking to) from a business standpoint, we’re often forced to move at the speed of thought.

    So, for this reason Chrisophe is right, before migration is the preference but as Dylan is saying migration is a special case that typically happens less frequently (hopefully) within an organization. Time and time again, we’ve been brought in after the migration to cleanse data. Often times multiple times based on the setup and configuration of the systems in place. Going back to the original comment, root cause analysis is important but I use that analysis for different things depending on where we are at in the process (pre vs post migration).

    Reply to this comment

Leave a Reply