The purists’ view is: a data warehouse integrates and stores data sourced from operational systems. In other words, a data warehouse does not create new data: it only combines and repackages data created elsewhere for analysis.
But anyone who has ever designed and built a data warehouse would tell you that if you take the purist view, your data warehouse wouldn’t be very useful. There is a class of data absolutely essential for data warehousing and analytics that cannot be sourced anywhere. The only way is to create and maintain them as a part of the data warehouse or the analytical environment. In addition, this class of data is so business critical that some type of stewardship is a necessity.
What are they?
Cross References. To create an integrated view of your business, codes from different internal systems as well as external data feeds need to be mapped to each other. This is not limited to detail level customer and product records, but also the important but less glorious codes like order type, channel, gender code, etc. Creating and maintaining cross references require current and detailed business knowledge.
Associations. For analysis, the attributes available from transactional systems are never sufficient. New attributes are needed, and they’re often new associations such as simple roll-ups, or many-to-many tagging or classifications. For example, a manufacturer needs to closely monitor sales to 10 “strategic customers”; an airline needs to slice data based on a new way to segment frequent flyers. Occasionally, they are composite roll-ups. For example, a market segment defined by laundry products sold at grocery chains in Latin America – a composite of product, customer and geography. Senior-level business managers and analysts need to create and maintain this set of data.
Hierarchies. These are the classic multi-level roll-ups that are essential for making data easy to navigate for the end user. The problem is that different users need different hierarchies, even though the underlying data are the same. For example, supply chain wants package type above brand in a product hierarchy, while marketing wants it the other way around. You want to be able to cater to local needs, but you also don’t want hierarchies to proliferate. Again, stewardship is essential.
Transformation Rules. When you develop transformations, you want them to be as data driven as possible. Some examples are unit conversion factors, look-up tables, even formulas for metrics. How would you maintain the data that drives the transformation rules? Recently, a large cable operator licensed our MDM product to do exactly this in front of their EDW.
Obviously, this class of data is critical to any data warehouse. They require a lot of business knowledge to maintain. And they need data stewardship. Every data warehouse has this problem. When I built data warehouses the old fashioned way before Kalido, we used to write simple forms that sit on top of the data warehouse so my business users have a way to manage them. The cost was typically not accounted for in the project budget. And I wished we had something with a flexible model, validation rules and workflow, but that would be way too expensive.
Kalido’s MDM product supports this use case, but often, we’re reluctant to use the term MDM when talking to customers with this problem, because using MDM for this generates more controversy than it clarifies. This class of data falls outside the conventional mold of master data. We at Kalido used to call them reference data, but that term is becoming adopted for industry standard codes sourced externally. Some people call them hierarchies — since there is a segment of software products for managing hierarchies — but clearly this class of data is far broader than simple hierarchies.
We need a name of this class of data. Do you have a suggestion?