Data Migration and Customer Data Integration with Talend - Part 2

Mastering Data Migration: Ensuring Data Quality and Consistency with Talend Toolset

One major complexity of data migration projects is ensuring data quality and consistency, especially when multiple legacy environments are involved. To address this, a structured methodology and approach to data object definition, mapping, and cleansing rules is essential. In this second part of our data migration series, we will delve deeper into the possibilities offered by the Talend data quality toolset for data migration projects.

Often, there is a significant mismatch between the estimations and the actual impact of data migrations within larger transformation projects. After evaluating and investing a substantial amount of money into new systems, preparing data for migration can sometimes seem less exciting. However, based on our experience, it is remarkable how data migration project budgets are frequently created under the assumption of near-perfect data. Even though the cost of data migration is typically only a fraction of the total expenditure on purchasing and implementing a new system, it remains a crucial success factor for the overall project and its adoption. Indeed, failing in data migration and data quality can exponentially increase costs and diminish the value of any project for your organization. Therefore, thorough analysis, appropriate planning, early involvement of stakeholders, and ownership from across different business areas are essential for success in data migration.

Determining Data Quality for Successful Migration

Migrating data from one or more applications to others involves examining objects, data models, data structures, and definitions. These tasks can become even more complex due to the data quality issues that often exist in legacy systems. Consequently, most data migration projects evolve into data quality projects, providing an opportunity to establish a future-proof data governance strategy alongside the migration efforts.

Data Quality Audits – Establishing Your Starting Point

In addition to creating an inventory of data sources and objects to be migrated, along with their dictionary and data model, conducting an early data quality check can significantly help generate a realistic estimate of your migration efforts. At the initial stage of your project, it’s important to start profiling the data in your legacy systems. Even if you don’t yet have a mature design for the target system or specific data requirements, you can still identify key data objects, such as customer information and product attributes. Moreover, an early quality audit of the legacy system can be further refined and enhanced in future project iterations, ensuring a smoother migration process.

Talend Data Quality: Enhancing Migration Projects

With a comprehensive toolset of data quality components, Talend can support migration projects from the very beginning.

  • Data Profiling and Quality Reports

Data profiling enables the creation of data quality reports, allowing you to analyze legacy data and gain initial insights through full-volume analysis.

  • Data Cleansing, Matching, and Consolidation

Defining data cleansing, matching, and consolidation rules prevents the transfer of dirty data into new systems. Automatic matching algorithms and machine learning can identify duplicates, golden records, and apply survivorship rules.

  • Business Ownership and Governance

Business ownership and governance of data quality are supported by applications like the Data Stewardship Console and Data Prep Tooling. These tools allow users to work in an Excel-like environment to check and cleanse suspect records, combined with automated cleansing rules.

  • Audit Trail and Compliance

Logging cleansing and transformation actions generates an overall audit trail for your migration project, helping to track progress and fulfill compliance requirements.

Examples from Our Daily Practice

One recent use case involved consolidating a product hierarchy across various legacy applications. The challenge was to map local variations to a global hierarchy model with cleansed product master data for consolidated reporting. Each product (item) needed to be mapped to the master hierarchy. To simplify, we started with a simple 2-level hierarchy, mapping each category present in the data.

Example: Product Master Data Consolidation

Master Hierarchy Structure

Let’s examine a basic master hierarchy structure for product categories.

Data Hierarchy

Compared with another product hierarchy existing in one of the legacy applications, differences are easy to spot.

Mapped Hierarchy

The hierarchy mapping is carried out and marks out the missing / unmapped entries compared with the master hierarchy:

Handling Unmapped Entries

The “UNMAPPED” represents the one (Category or Sub-Category) that are not present in the Master hierarchy structure. Those unmapped fields need to be validated by functional data owners (SMEs / Data Stewards).

Preparation steps involved

  1. Identify distinct category/sub-category combinations from the data sources.
  2. Manually map the product hierarchy from the data source to the master structure.
  3. Use Talend Data Integration (DI) to expose the “UNMAPPED” data hierarchy to the Talend Data Stewardship (TDS) Console.

Validation and Resolution

SMEs or Data Stewards ensure that valid categories (and sub-categories) replace the “UNMAPPED” ones. They validate the hierarchy and resolve issues by tagging valid categories/sub-categories. Once the hierarchy mapping is completed in the TDS console, the next step is to update the target data to reflect the correct/latest hierarchy using Talend DI again.

Interface for Hierarchy Management

An excel like frontend for data stewards to analyze and maintain hierarchy records

Talend Data Stewardship Console

Automatic validation of the columns helps to identify suspect or wrong records

Product Data Analysis – Identifying Discrepancies and Golden Records

In addition to cleaning up the product hierarchy, product master data attributes in the data source must be validated for data quality. The data should be classified into the following discrepancy categories:

  • NULL fields
  • Duplicates (Numeric/Reference attribute)
  • Invalids (specific to attributes such as barcodes/GTIN)
  • Duplicates based on product description (the description might differ, but the product may be the same)

The starting point is creating a unique identifier for each record to establish an audit trail for any modifications, consolidations, and transformations. A sequential ID for each record is sufficient for easy identification later. Using Talend DI and DQ components, source data is replicated into multiple streams to perform similar operations across multiple attributes:

  • Identify NULL values
  • Identify attribute-level duplicates by grouping the data by attribute value and filtering data with more than one instance of the same attribute value
  • Validate standardized identifiers like barcodes:
    • NULLs
    • Duplicates
    • Invalids (e.g., based on length, non-numeric characters, incorrect check digits)
    • Patterns (from a DQ profiling perspective)
  • Consolidate discrepancies for all attributes

Additional columns/attributes are created to identify:

  • Discrepancy Type (NULLs, Invalid, Duplicate)
  • Discrepancy Attribute/Field (attribute in scope for that specific data row)

Finally, the product hierarchy master is mapped to the discrepancy table for reporting and statistics.

Golden Record Identification

Compare input sources against identified discrepancies to filter out the remaining records. These remaining records need further analysis.

  • Identify duplicates based on product description using integrated fuzzy logic and algorithms with Talend DQ.
  • Segregate records into unique, duplicate, and suspect categories.
  • Merge duplicates.
  • Suspect records are provided to data stewards for further analysis. Feedback is used to modify and fine-tune the deduplication algorithms.
  • Unique records that do not fall under any discrepancies are classified as “Golden Records”.
Reporting

During all processing steps, operations are logged at the record level. As a result, the resulting log tables can be easily queried to generate multiple reporting formats. Furthermore, since all output data is mapped against the master product hierarchy, statistics can be broken down by country, legacy system, or category/sub-category, thereby allowing data quality to be assessed at a granular level.

To illustrate this, consider a simplified example from our daily practice. In essence, our primary goal is to create a migration framework that can be used not only once but also transferred into a governed new environment, ultimately ensuring that data remains clean and in sync over time.

Data Governance: Maintaining Data Quality Post-Migration

Data quality is an ongoing effort. First and foremost, after making substantial investments to ensure high-quality data is entered into your new systems, there is also an opportunity to implement data quality processes to further leverage these migration efforts. By governing data maintenance and monitoring data movement between different systems, you can consistently maintain control. Moreover, if you plan to invest in data quality tools to complete your migration, be sure to include monitoring and reporting capabilities to sustain data quality post-migration. Ultimately, these steps will help ensure that your data remains reliable and accurate over time.

Take action now

Ready to ensure your data migration project is a success? Start by leveraging Talend’s comprehensive data quality toolset to identify and resolve discrepancies, establish golden records, and maintain data quality post-migration. Contact us today to learn how we can help you achieve a seamless and effective data migration.

Contact

Robin du Maine
Phone: +31 73 3030458
E-Mail: robin.dumaine@cimt.nl

Scroll to Top