What’s the Root Cause of Bad Data?

Starting this week, I’ll be publishing a series of blogs on enterprise data governance. As always, I welcome your comments and feedback.

When it comes to data management, presentations and whitepapers all have a very consistent theme: Data is important, and we need to do something about it. The vendor landscape changed. Technology fashion changed. But the message remains the same, almost as if nobody is aware of the problem or has done anything about it.

Let’s look at the facts. How much have we spent on data management over the last 10 years?  Gartner says that worldwide IT spend in 2008 was a whopping $3.4 trillion, an 8% growth over 2007. That’s the GDP of Germany. Of that, about $40 billion was spent on data management software alone. Assuming typical labor-software-hardware ratios and an 8% growth rate, a rough answer to my question is $1.4 trillion. Clearly, we’ve done plenty about data. But are we doing the right thing?

Near the end of 2008, the global financial system stood at the edge of the abyss. In a conference call with analysts, the CEO of a global banking giant was repeatedly asked to quantify mortgage-backed security holdings on the bank’s books. “I don’t have that information,” said the CEO over and over.

During the previous 10 years, this bank had spent a total of $37 billion on IT operations alone. We now know the bank was solvent. But at that moment, facing a collapsing stock and plummeting market confidence, the CEO couldn’t produce the one piece of data that could’ve saved his company and his job. After $37 billion spent. It was staggering.

That was not an isolated incident. Survey after survey points to chronic data problems in most organizations. $1.4 trillion hasn’t done the trick. There’s no reason to believe that spending more money doing the same things will make things any better. We need to rethink the problem and come up with a different approach. But to get to the right approach, we first need to identify the root cause.

Let’s take an example of a simple data quality problem. The finance department can’t send out an invoice because the customer’s billing address is missing. So finance calls the sales person. If the sales person doesn’t have it on hand, someone will need to contact the customer. A few days later, the right billing address is unearthed and an invoice sent. Stories like this are repeated every day, everywhere. Payment is delayed, affecting cash flow. Normal business process breaks down, increasing cost. The economic impact is very real.

The obvious solution is to make sure that each sales rep enters a complete and accurate billing address when entering an order. The best way to tackle data quality problem is do it as upstream as possible: at the point of entry, the moment someone captures the real world in bits and bytes. There is one problem: who will tell Sales?

The elephant in the room is that good data has a cost. It takes time and discipline to investigate, verify, and put good data in a system. And this piece of data — billing address — is not required for the Sales function. But finance needs it. And other business processes need it for operations or analysis. Data generated by one business function is consumed by multiple business functions downstream, often very distant from the point of entry. In other words, a large group of people benefit from good data, but they are usually not the same people who bear the cost of good data.

This poses an organizational and behavioral challenge: How do we make people accountable for good data that benefits others, most of whom they don’t even know about? Where does the authority come from? What are the positive and negative incentives?

Another challenge is that Sales is not the only group that can create, change or access customer data. Customer Service can, and so can Finance. An even larger group of people can see and report problems with data. If we assign sole ownership of customer data to sales, we absolve the rest of the organization from their responsibility.

Important data assets have multiple providers and multiple consumers who are often unaware of each other, and data quality is often not in the immediate interest of data providers. There is no transparency and accountability. This is the root cause of bad data. In the case of our global bank, traders and risk management staff in thousands of pockets throughout the globe are the providers of data that the CEO needed. When there’s no transparency and accountability, the aggregate data is untrustworthy. This is one of the key reasons that the big bank’s CEO lost his job, and its shareholders got nearly wiped out.

In my next blog, I’ll discuss various approaches to the problem and their merits.


This blog is part 1 of a multi-part series of blogs on the topic of Enterprise Data Governance. To read other posts from this series, please see below.

Part 2: Traditional Approach to Data Management Only Treats the Symptoms

Part 3: What do Environmental Policy and Data Governance Have in Common?

Part 4: Data Policies are the Instruments of Data Governance

Part 5: Data Governance Should be Formalized as a Business Process

Part 6: Send in the Yellow Jerseys: Organizing for Data Governance

Part 7: How to Set the Right Initial Scope for Data Governance?

Part 8: How to Build a Business Case for Data Governance?

11 replies
  1. Neil Raden
    Neil Raden says:


    Many years ago when I was trying to explain the concept of data integration for data warehousing to clients I used the phrases “locally consistent” and “globally consistent” to describe the situation you explained more clearly here. The problem is that systems that accept primary data are (hopefully) locally consistent in that the data makes sense for the purposes of that one system. It’s a little like Simpson’s Paradox though, because you put all of these locally consistent sets of data together and they are not globally consistent.

    How do get a primary source system to arrange its data and rules to be globally consistent? That’s the heart of the problem.

    I’m looking forward to the rest of this series.


  2. Winston Chen
    Winston Chen says:

    Thanks for your comment. I couldn’t agree more. The most intractable problems in data management are caused by the conflict between local optmization and global optimization. The conflict arises in technical things like data models and rules. My blog points out the human behavioral conflict, between the short-term, local interest of the data providers versus the long-term, communcal (global) interest of the data consumers. This is why governance is so important.

  3. Mike Wheeler
    Mike Wheeler says:

    Without saying it directly you are both addressing the subject of data utilization across the end-to-end business process. As we look at data from a holistic Data Governance perspective, traditional validation techniques are inadequate – as they are event focused. Only by associating data to the overall business processes that it interacts with can we truly begin to understand the policies which need to be defined to ensure accuracy, timeliness, and value of those data assets.

  4. Lindsey Niedzielski
    Lindsey Niedzielski says:

    Great post Winston. I really like your analysis of data and how critical it is analyze it in its beginning stage, at its entry point. We have a community for IM professionals (www.openmethodology.org) and have bookmarked this post for our users. Look forward to reading your work in the future.

Trackbacks & Pingbacks

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply