If you believe that better data quality has huge business value, and you believe the old axiom that you cannot improve something if you cannot measure it, then it follows that measuring data quality is very, very important. And it’s not a one-time exercise. Data quality should be measured regularly to establish a baseline and trend; otherwise continuous improvement wouldn’t be possible.
Measuring data quality is not simple. We have all been exposed to metrics like accuracy, completeness, timeliness, integrity, consistency, appropriateness, etc. Wikipedia’s entry for Data Quality says that there’re over 200 such metrics. Some metrics, like completeness and integrity, are relatively easy to measure. Most data quality tools and ETL tools can express them as executable rules. But others are a lot harder to measure.
Accuracy is notorious. Let me give you an example. A Canadian law enforcement agency saw that in crime statistics, pickpocketing is usually high. Further investigation revealed that in the application for entering crime reports, “Pickpocketing” is the first item in the dropdown list box for crime type. So, how would one go about measuring the accuracy of this field? I can only think of two good ways.
First is to manually audit a sample. Take a small percentage of new crime reports and have data analysts go through them to determine if, given other pieces of descriptive information, the crime type field is accurate.
Second is to allow anyone in the organization to identify data inaccuracies and raise issues. The issues can then be routed to the right person for correction. And the issues can be rolled up to compile metrics. This approach is akin to crowd-sourcing.
I’ve seen other ways but I don’t think they’re very effective. You could compare the data with authoritative records. But if you had authoritative records, this wouldn’t be a problem in the first place! You could also measure statistical distribution and detect anomalies. For example, pick-pocketing typically represent 10% of all crimes; if it goes up to 15%, then there may be a problem. But it’s very hard to tell whether the data is wrong or there is an actual change in the real world. You end up resorting to manual auditing again.
Of these techniques, I think crowd-sourcing is the best. The trick is to provide end users with a dead easy way to raise an issue the moment an inaccuracy is discovered. Both Magnitude MDM and Data Governance Director provide browser interfaces for raising issues. We also have an open API for issues to be reported, tracked, and acted upon.
Ideally, in every screen that presents data to end users, whether it’s a business application, dashboard, or report, there is a button for raising data issues. So SAP and Oracle, what are you waiting for?
- What’s the Root Cause of Bad Data?
- Please Stop Trying to Come Up with a Single Enterprise Definition of Customer
- The Biggest Philosophical Debate in Data Management
- Physics of Information Management: Work Done by a Spring
- A Brief History of Data Governance
- Managing Master Data Using Federalist Principles