The Quest for Golden Data

It seems these days that everyone is looking for a “360-degree view” of customer, product, vendor and just about every category of information you can think of. However, getting there is not always so easy, and gets harder as organizations get larger.

Today, very few organizations have only one operational system, and many have several. It is not uncommon for us to encounter clients with 3, 4 or more ERPs, not to mention other legacy applications and data sources. As these systems grew over time they became more and more siloed, with each system focusing on a different view of data, all of which much be reconciled and combined. How do we integrate say, customer data, from 4 or 5 sources when there are no common identifiers and even things like names and addresses are inconsistent in format or spelling?

The answer is we need to “harmonize” the data. Harmonization is an umbrella term that includes the functions of extraction from the source, transformation as required, mapping, matching, survivorship and merging.

Magnitude MDM and Data Harminization

A brand-new harmonization module was added to MDM in Version 10 last year. Through harmonization, we can use a variety of fuzzy matching algorithms and rules to match together disparate data, deduplicating both within and across sources. Incoming data is then matched against existing records to align similar data. Once deduplicated and aligned, survivorship rules select the “best” source for each attribute in the data, producing the best possible “golden copy” record.

Matching produces a “match score” for each pair of records based upon the degree of similarity and the precise rules used. The ability to set a weight (how much a particular attribute match contributes to the score), penalty (how much a no-match on an attribute penalizes the score) and strictness (how fuzzy the match should be) allows some very fine adjustments to be made. A match preview is provided to allow the designer to fine tune the rules for each particular source.

But not every match results in a clear set of matches and no-matches. Because the matching is imprecise (i.e. fuzzy) from the start, there will be a range of match scores. Some matches will have very high scores and will be clearly the same “thing” while others will have very low scores and will be clearly a different thing. But there will also be a range of scores where it won’t be obvious that two records are talking about the same thing – where the records have some things in common, but others that are different.

To deal with this “grey area” we have both an “auto accept threshold” and a “manual review threshold.” What these two setting do is to set match scores below which a record will be automatically promoted to a new master record, and a score above which the records will be automatically merged, according to the survivorship rules. The matches that fall in between, in the “grey area,” will be routed to data stewards for review and confirmation. Data stewards can choose to accept the match (which will merge the records) or to reject it (which will promote the source record) either one match at a time or in bulk.

This manual review option is critical to a good result since sometimes the decision to accept or reject a match is based not solely on the data content but rather requires some knowledge of the business.

Consider the example shown – the address is identical, but the name is slightly different. Should these records be merged or not? It depends on the business purpose of the harmonization. If we are trying to consolidate all business entities down to a single mailing address, then we may want to merge them. However, if the goal is to identify each legal entity we deal with then perhaps the records should remain distinct.

This points up the most important aspect of harmonization – it is not strictly a technical process. Deduplication can be automated only up to a point; there will always be records that a purely technical approach cannot address. Like so many other areas of master data management, it requires a knowledge of the business goals of the project and the business processes that will consume the data.

Harmonization will continue to evolve in functionality over the next several releases, including more refined survivorship options, more matching functions (including the introduction of machine-learning) and improved user interface. User response to harmonization has been very positive, with almost all new customers licensing the module and many existing clients upgrading to include it.

One of the tasks any MDM solution must perform is integration of information – Harmonization makes integration of related data from different sources easier, faster and introduces higher levels of automation.