How to Measure Data Accuracy?

(Updated October 2018) If you believe that better data quality has huge business value, and you believe the old axiom that you cannot improve something if you cannot measure it, then it follows that measuring data quality is very, very important. And it’s not a one-time exercise. Data quality should be measured continuously to establish a baseline and trend; otherwise continuous improvement wouldn’t be possible.

Measuring data quality is not simple. We have all been exposed to metrics like accuracy, completeness, timeliness, integrity, consistency, appropriateness, etc. Wikipedia’s entry for Data Quality says that there’re over 200 such metrics.  Some metrics, like completeness and integrity, are relatively easy to measure. Most data quality tools and ETL tools can express them as executable rules. But others are a lot harder to measure.

Accuracy measurement is tricky since it is not always a yes or no issue. For example, when geo-coding locations, accuracy is a matter of degree (or minutes and seconds). Recording names accurately can also be challenging: “correct” spelling, for example, is what the individual says is correct, without regard for standards of any kind.
This brings up one of the most basic truths about master data: you can not manage master data without understanding the context in which it will be used. Data quality tools can do many things, like format correction, whitespace removal, case normalization, address validation, etc. But even the most accurate address is useless if it is associated with the wrong customer, vendor or property. This kind of data accuracy has no rules or formulas that can be applied. It requires a person, known as a Data Steward, to review the information in context.
However, you can’t expect a single person to know every detail of every record. So how to get the data validated? You could compare the data with authoritative records. But if you had such authoritative records, this wouldn’t be a problem in the first place! You could also measure statistical distribution and detect anomalies, but that is not very effective against incomplete or broadly inaccurate data. Machine learning could potentially play a role here too. But it’s very hard to tell whether the data is wrong or there is simply a change in the real world. In many cases you end up resorting to manual auditing again.

But there is a better way. The best solution is to allow anyone in the organization to identify data inaccuracies and raise issues. The issues can then be routed to the data steward(s) for correction. And the issues can be rolled up to compile metrics. This approach is akin to crowd-sourcing.

The trick is to provide end users with a dead easy way to raise an issue the moment an inaccuracy is discovered. Magnitude MDM provides browser interfaces for raising issues. We also have an open API for issues to be reported, tracked, and acted upon. We even can export data to a specially formatted Excel file, allowing users to review and correct data off line, but then return the changes back to MDM, with all the validation checks and data quality rules intact.

This context-oriented view is integral to the way Magnitude approaches master data management and data quality. It is why we start with a business information model – not a data model. A business model that reflects the way information is used in the business, not the way data is stored. We believe this distinction is critical to the success of any master data management initiative. Contact us to find out how we can help you manage your master and reference data more effectively!
Related Blogs:


10 replies
  1. Dylan Jones
    Dylan Jones says:

    One approach I’ve seen to reducing these kind of data-entry related style inaccuracies is to design contextual, dynamic forms.

    For example, if you are entering details of a pickpocket, you may wish to enter details of the victim, time of day, street, pickpocket approach – was it violent/in busy crowd/at a concert etc.

    If the crime was burglary then there would be a completely different set of fields.

    The point being that sometimes the form design creates inaccuracies, by making it easier for the staff to enter the correct information I’ve seen far better accuracy.

    I agree with your point completely though that it is far too difficult for down-stream data users to flag issues with the data and this is just a matter of common sense and basic process improvement.

    • Winston Chen
      Winston Chen says:

      Dylan, yes, form design can absolutely improve accuracy. And this is something application vendors should pay more attention to. Also, as you said, process improvement is ultimately the most effective cure for data quality problems. Thanks for your comment.

      • Julian Schwarzenbach
        Julian Schwarzenbach says:

        Winston, Dylan,

        Another way to counteract the problem of the default option being left unchanged is to set the default value as “Please select”. This makes it even easier to spot those that have not bothered to enter a suitable value!


  2. Julian Schwarzenbach
    Julian Schwarzenbach says:


    I fully agree that measuring accuracy is both a vital activity and also one that is difficult to undertake. Your ‘pickpocket’ example is a good one, as it will be difficult to go back to those involved in a crime to confirm the details of the events.

    In the physical asset management world accuracy checking is made difficult for a number of reasons:
    1. Assets are frequently widely dispersed, so accuracy checking may involve significant amounts of travel
    2. Assets may be in hazardous locations which prevent easy access and may require permits to work, multi-person teams etc.
    3. Assets such as pipes and cables will typically be buried, so cannot be accessed to check the data accuracy
    4. Due to the wide variations in types and ages of assets deployed, it can be difficult to ensure that samples of assets checked for accuracy represent a valid subset of the overall asset stock
    5. Relying on checking data when someone has to respond to a problem will not be representative of the full population of assets

    Although all these points indicate the difficulty of assessing the accuracy of asset data, these should not be used as excuses for not assessing your data accuracy. Without a valid assessment of accuracy there is a risk that resulting business decisions may be compromised.


    • Winston Chen
      Winston Chen says:

      Julian, Thanks for your comment. I heard a story from an oil pipeline operator about how often a truck driving far out to perform maintenance on a piece of asset, and realizing that the data about the asset is wrong, and he/she brought the wrong equipment. You’re right, physically assets present their unique challenges.

  3. Ken O'Connor
    Ken O'Connor says:

    Hi Winston,

    Great post – well done. I really like the idea of empowering everyone in the organisation to flag data quality issues.

    Your post prompted me to write about a new post about what I call the “Ryanair Data Entry Model”.

    Rgds Ken

  4. Sushil Kumra
    Sushil Kumra says:

    Data quality measure is challenging but not impossible task. There is no silver bullet. One has to define valid data values for each data element collected so that one would know what we are measuring against. The descriptive data collection and validation is always a challenge. In the descriptive data collection “Drop-down” are often used to minimize the key strokes and improve the data accuracy. Human being human will make mistake and select a wrong choice.
    To fix this problem, one needs to develop data validation based on the event context. If we are collecting data about a crime, as Dylan suggested, there are some data elements unique to a particaular crime. For example, pickpocking location is a house address defintely raises a suspicion if captured data in the crime field is accurate. This validation needs to take place as data is being submitted to save and store. To develope context based validation is a daunting task but I believe will be effective.
    Another simple way to measure data quality is by using :Data Profiling” tool. One can determine what kind of data quality issues are there and take appropriate measure to fix it.

    • Winston Chen
      Winston Chen says:

      Thanks Sushil for your comment. You’re absolutely right that event context is the key to solving the problem, but it is not easy. Context is a hard thing for computers to get — which makes automation hard.

    • John Evans
      John Evans says:

      This is a good question. Given there is no standard for this you should impose your own policies on what you consider to be authoritative. I think there are two angles to this. One is, are the search results from authoritative sources, and two, are the search results relevant to the search term. These are not necessarily the same.

      You can influence relevance and authoritativeness of search results by using advanced search syntax to exclude sites you don’t believe are authoritative, as well as to exclude homonyms and to control context. For example, using Google, if you wanted to search the web for information on “jaguar” the animal, a search term “jaguar -car” would remove results related to the car brand. This would therefore increase relevance. If you did not consider Wikipedia authoritative, you could add “” to the search syntax, or if you did not consider .net domains as authoritative, you could use “-site:*.net” which would remove results from any site using a .net domain. There is also Google syntax to limit results to come from a particular site that you consider authoritative.

      This is a simple example of how you might try to get more “accurate” search results, other than simply relying on the algorithms and indexes used by your favorite search engine.

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply