Data Migration and Customer Data Integration with Talend - Part 2

Mastering Data Migration: Ensuring Data Quality and Consistency with Talend Toolset

One major complexity of data migration projects is ensuring data quality and consistency, especially when multiple legacy environments are involved. To address this, a structured methodology and approach to data object definition, mapping, and cleansing rules is essential. In this second part of our data migration series, we will delve deeper into the possibilities offered by the Talend data quality toolset for data migration projects.

Often, there is a significant mismatch between the estimations and the actual impact of data migrations within larger transformation projects. After evaluating and investing a substantial amount of money into new systems, preparing data for migration can sometimes seem less exciting. However, based on our experience, it is remarkable how data migration project budgets are frequently created under the assumption of near-perfect data. Even though the cost of data migration is typically only a fraction of the total expenditure on purchasing and implementing a new system, it remains a crucial success factor for the overall project and its adoption. Indeed, failing in data migration and data quality can exponentially increase costs and diminish the value of any project for your organization. Therefore, thorough analysis, appropriate planning, early involvement of stakeholders, and ownership from across different business areas are essential for success in data migration.

Determining Data Quality for Successful Migration

Migrating data from one or more applications to others involves examining objects, data models, data structures, and definitions. These tasks can become even more complex due to the data quality issues that often exist in legacy systems. Consequently, most data migration projects evolve into data quality projects, providing an opportunity to establish a future-proof data governance strategy alongside the migration efforts.

Data Quality Audits – Establishing Your Starting Point

In addition to creating an inventory of data sources and objects to be migrated, along with their dictionary and data model, conducting an early data quality check can significantly help generate a realistic estimate of your migration efforts. At the initial stage of your project, it’s important to start profiling the data in your legacy systems. Even if you don’t yet have a mature design for the target system or specific data requirements, you can still identify key data objects, such as customer information and product attributes. Moreover, an early quality audit of the legacy system can be further refined and enhanced in future project iterations, ensuring a smoother migration process.

Talend Data Quality: Enhancing Migration Projects

With a comprehensive toolset of data quality components, Talend can support migration projects from the very beginning.

Data Profiling and Quality Reports

Data profiling enables the creation of data quality reports, allowing you to analyze legacy data and gain initial insights through full-volume analysis.

Data Cleansing, Matching, and Consolidation

Defining data cleansing, matching, and consolidation rules prevents the transfer of dirty data into new systems. Automatic matching algorithms and machine learning can identify duplicates, golden records, and apply survivorship rules.

Business Ownership and Governance

Business ownership and governance of data quality are supported by applications like the Data Stewardship Console and Data Prep Tooling. These tools allow users to work in an Excel-like environment to check and cleanse suspect records, combined with automated cleansing rules.

Audit Trail and Compliance

Logging cleansing and transformation actions generates an overall audit trail for your migration project, helping to track progress and fulfill compliance requirements.

Examples from Our Daily Practice

One recent use case involved consolidating a product hierarchy across various legacy applications. The challenge was to map local variations to a global hierarchy model with cleansed product master data for consolidated reporting. Each product (item) needed to be mapped to the master hierarchy. To simplify, we started with a simple 2-level hierarchy, mapping each category present in the data.

Example: Product Master Data Consolidation

Handling Unmapped Entries

The “UNMAPPED” represents the one (Category or Sub-Category) that are not present in the Master hierarchy structure. Those unmapped fields need to be validated by functional data owners (SMEs / Data Stewards).

Preparation steps involved

Identify distinct category/sub-category combinations from the data sources.
Manually map the product hierarchy from the data source to the master structure.
Use Talend Data Integration (DI) to expose the “UNMAPPED” data hierarchy to the Talend Data Stewardship (TDS) Console.

Validation and Resolution

SMEs or Data Stewards ensure that valid categories (and sub-categories) replace the “UNMAPPED” ones. They validate the hierarchy and resolve issues by tagging valid categories/sub-categories. Once the hierarchy mapping is completed in the TDS console, the next step is to update the target data to reflect the correct/latest hierarchy using Talend DI again.

Product Data Analysis – Identifying Discrepancies and Golden Records

In addition to cleaning up the product hierarchy, product master data attributes in the data source must be validated for data quality. The data should be classified into the following discrepancy categories:

NULL fields
Duplicates (Numeric/Reference attribute)
Invalids (specific to attributes such as barcodes/GTIN)
Duplicates based on product description (the description might differ, but the product may be the same)

The starting point is creating a unique identifier for each record to establish an audit trail for any modifications, consolidations, and transformations. A sequential ID for each record is sufficient for easy identification later. Using Talend DI and DQ components, source data is replicated into multiple streams to perform similar operations across multiple attributes:

Identify NULL values
Identify attribute-level duplicates by grouping the data by attribute value and filtering data with more than one instance of the same attribute value
Validate standardized identifiers like barcodes:
- NULLs
- Duplicates
- Invalids (e.g., based on length, non-numeric characters, incorrect check digits)
- Patterns (from a DQ profiling perspective)
Consolidate discrepancies for all attributes

Additional columns/attributes are created to identify:

Discrepancy Type (NULLs, Invalid, Duplicate)
Discrepancy Attribute/Field (attribute in scope for that specific data row)

Finally, the product hierarchy master is mapped to the discrepancy table for reporting and statistics.

Golden Record Identification

Compare input sources against identified discrepancies to filter out the remaining records. These remaining records need further analysis.

Identify duplicates based on product description using integrated fuzzy logic and algorithms with Talend DQ.
Segregate records into unique, duplicate, and suspect categories.
Merge duplicates.
Suspect records are provided to data stewards for further analysis. Feedback is used to modify and fine-tune the deduplication algorithms.
Unique records that do not fall under any discrepancies are classified as “Golden Records”.

Reporting

During all processing steps, operations are logged at the record level. As a result, the resulting log tables can be easily queried to generate multiple reporting formats. Furthermore, since all output data is mapped against the master product hierarchy, statistics can be broken down by country, legacy system, or category/sub-category, thereby allowing data quality to be assessed at a granular level.

To illustrate this, consider a simplified example from our daily practice. In essence, our primary goal is to create a migration framework that can be used not only once but also transferred into a governed new environment, ultimately ensuring that data remains clean and in sync over time.

Data Governance: Maintaining Data Quality Post-Migration

Data quality is an ongoing effort. First and foremost, after making substantial investments to ensure high-quality data is entered into your new systems, there is also an opportunity to implement data quality processes to further leverage these migration efforts. By governing data maintenance and monitoring data movement between different systems, you can consistently maintain control. Moreover, if you plan to invest in data quality tools to complete your migration, be sure to include monitoring and reporting capabilities to sustain data quality post-migration. Ultimately, these steps will help ensure that your data remains reliable and accurate over time.

Take action now

Ready to ensure your data migration project is a success? Start by leveraging Talend’s comprehensive data quality toolset to identify and resolve discrepancies, establish golden records, and maintain data quality post-migration. Contact us today to learn how we can help you achieve a seamless and effective data migration.

Contact

Robin du Maine
Phone: +31 73 3030458
E-Mail: robin.dumaine@cimt.nl

Author

Taco van het Reve

Mastering Data Migration: Ensuring Data Quality and Consistency with Talend Toolset

Determining Data Quality for Successful Migration

Data Quality Audits – Establishing Your Starting Point

Talend Data Quality: Enhancing Migration Projects

Examples from Our Daily Practice

Example: Product Master Data Consolidation

Master Hierarchy Structure

Data Hierarchy

Mapped Hierarchy

Handling Unmapped Entries

Interface for Hierarchy Management

Talend Data Stewardship Console

Product Data Analysis – Identifying Discrepancies and Golden Records

Golden Record Identification

Reporting

Data Governance: Maintaining Data Quality Post-Migration

Take action now

Contact

Author

Where you can find us

hamburg

An der Alster 6420099 Hamburg

+49 40 533 02-0

berlin

Bundesallee 39-40a

10717 Berlin

+49 30 834 098-0

Dusseldorf

Steinstraße 4

40212 Dusseldorf

+49 211545907-0

frankfurt

Hanauer Landstraße 121

60314 Frankfurt

+49 69 9150 687-0

cologne

Lichthaus

Oskar-Jäger-Straße 170

50825 Cologne+49 221 367 986-0

vught, nl

SparrendaalseWeg 5

NL-5262 LR Vught

+31 73 30304-58

granada, ES

av. del conocimiento, 41

18016 Granada

An der Alster 64
20099 Hamburg

50825 Cologne
+49 221 367 986-0