In a recent webcast I did with Jim Harris, he talked about two views of data quality: provider centric and consumer centric. According to the provider centric view, data are just digital representations of real world things. If the representations are accurate, then you can use them for anything. In other words, data is good as long as data providers do their job right. The consumer centric view says, data is good only if it’s fit for use, i.e., if it meets the declared needs of consumers.
In fact, beyond data quality, many of the heated arguments in data management are rooted in this philosophical debate. And it is not just a theoretical argument: the implications to data management practices are huge.
Let’s take the debate between Inmon and Kimball. The Inmon camp takes the provider centric view. Data can be stored in an use case neutral way. Their states are absolute, like objects in classical, Newtonian physics. So, if you cleanse your data properly and organize it based on its intrinsic properties (3NF) in a big data warehouse, you can meet the needs of any consumer.
The Kimball camp, on the other hand, takes the consumer centric view: Data’s value is in the eye of the beholder and therefore relative. So data should be managed based on its specific use. This gives rise to the star schema, which organizes data to answer a bounded set of questions for a bounded set of consumers. Within those boundaries, navigation is easy and intuitive, and queries come back fast.
Provider centric people believe there is a single version of the truth. Consumer centric people are skeptical. Provider centric people think of data as bouillon that you can store in a vault; consumer centric people think of data more as employees, which are valuable only when they’re put to work on jobs that match their competencies.
This debate really comes down to a simple question on the nature of data: can data be consumer independent? In other words, is it possible for there to be a single set of data definitions and rules, which, when instantiated in a repository like MDM or data warehouse, can service any consumption needs? If the answer is yes, we should go with the provider view. If the answer is no, we should go with the consumer view.
As much as I wish for a simple world, I think the answer is, unfortunately, yes and no and maybe. It depends on what kind of data.
Some data are plain, immutable facts:
- A point-of-sale transactions.
- A customer’s legal name.
- A click on a web site.
In general, these types of data represent real world events and physical objects, so they are indeed consumer neutral. We should manage them in a provider centric way and try to establish a single version of the truth.
But other types of data are not so absolute:
- Hierarchies. They are typically designed with a purpose in mind. A single hierarchy will never meet the needs of all consumers.
- Customer classification. Unlike legal name, it has no basis in the real world.
- Web sessions. If the user navigates away f0r 10 minutes and come back, is it the same session? How about 2 hours?
These types of data are typically invented concepts, which, when created for one purpose, maybe utterly useless for another. We need to manage them in a consumer centric way.
Between these extremes there’re shades of gray. Data governance is critical in making these decisions in a collaborative way and express them as policies.