(Updated October 2018) If you believe that better data quality has huge business value, and you believe the old axiom that you cannot improve something if you cannot measure it, then it follows that measuring data quality is very, very important. And it’s not a one-time exercise. Data quality should be measured continuously to establish a baseline and trend; otherwise continuous improvement wouldn’t be possible.
Measuring data quality is not simple. We have all been exposed to metrics like accuracy, completeness, timeliness, integrity, consistency, appropriateness, etc. Wikipedia’s entry for Data Quality says that there’re over 200 such metrics. Some metrics, like completeness and integrity, are relatively easy to measure. Most data quality tools and ETL tools can express them as executable rules. But others are a lot harder to measure.
Accuracy measurement is tricky since it is not always a yes or no issue. For example, when geo-coding locations, accuracy is a matter of degree (or minutes and seconds). Recording names accurately can also be challenging: “correct” spelling, for example, is what the individual says is correct, without regard for standards of any kind.
This brings up one of the most basic truths about master data: you can not manage master data without understanding the context in which it will be used. Data quality tools can do many things, like format correction, whitespace removal, case normalization, address validation, etc. But even the most accurate address is useless if it is associated with the wrong customer, vendor or property. This kind of data accuracy has no rules or formulas that can be applied. It requires a person, known as a Data Steward, to review the information in context.
However, you can’t expect a single person to know every detail of every record. So how to get the data validated? You could compare the data with authoritative records. But if you had such authoritative records, this wouldn’t be a problem in the first place! You could also measure statistical distribution and detect anomalies, but that is not very effective against incomplete or broadly inaccurate data. Machine learning could potentially play a role here too. But it’s very hard to tell whether the data is wrong or there is simply a change in the real world. In many cases you end up resorting to manual auditing again.
But there is a better way. The best solution is to allow anyone in the organization to identify data inaccuracies and raise issues. The issues can then be routed to the data steward(s) for correction. And the issues can be rolled up to compile metrics. This approach is akin to crowd-sourcing.
The trick is to provide end users with a dead easy way to raise an issue the moment an inaccuracy is discovered. Magnitude MDM provides browser interfaces for raising issues. We also have an open API for issues to be reported, tracked, and acted upon. We even can export data to a specially formatted Excel file, allowing users to review and correct data off line, but then return the changes back to MDM, with all the validation checks and data quality rules intact.
This context-oriented view is integral to the way Magnitude approaches master data management and data quality. It is why we start with a business information model – not a data model. A business model that reflects the way information is used in the business, not the way data is stored. We believe this distinction is critical to the success of any master data management initiative. Contact us to find out how we can help you manage your master and reference data more effectively!
- What’s the Root Cause of Bad Data?
- Please Stop Trying to Come Up with a Single Enterprise Definition of Customer
- The Biggest Philosophical Debate in Data Management
- Physics of Information Management: Work Done by a Spring
- A Brief History of Data Governance
- Managing Master Data Using Federalist Principles