This seems so obvious that it hardly warrants mentioning, right? So why have I visited so many companies over my career where data marts, or even entire warehouses, were built using reference data from multiple sources where the data integration is ad-hoc and hardwired?
I can think of several reasons that relate to realities of business intelligence and where it diverges from its own best practice.
- First is the tendency to start with the facts and work outward when designing marts and analytics. It’s easy to understand: starting at the center and working outward is the way a lot of people—including data warehouse designers—think. (And is also probably why I will never forget that Paul Lynde was the original center square on the game show Hollywood Squares.) But the subtle message is that facts are the foundation of a data warehouse. (They’re usually not.) And, perhaps, items on the periphery—the highest levels of aggregation—are a little less important. (They’re definitely not.)
- Second, many marts and warehouses start with a single (or at least dominant) data source and add others later. When new data sources are added, the integration has often been scripted using ETL tools and focuses on a ‘hardwired’ stitching of the new data into the existing data set, rather than a general purpose and flexible integration.
Those design tendencies, and the ‘hardwired’ integration they spawn, contribute a great deal to the shaky foundations of many data warehouses that lead to their premature demise. Even in successful warehouses, this rarely takes more than five years.
Tackling the reference data integration issues up-front builds trust in the data as well as keeps the design emphasis focused on how the data will be used once it has been added to the warehouse. And even though it may seem like you are making slow progress while spending that extra bit of time getting the reference data integration right, that’s okay; because the transactions are the fastest part of a warehouse to add, especially when the reference data is reliable.
When the warehouse project begins with a data integration effort using a master data management solution, it enables the business to start owning and governing the integrated data set. Otherwise, the business will likely focus its maintenance efforts on the central data source and leave the integrated data set to the technical crew. This way, not only is the business building the trust in the integrated data, but also developing responsibility for maintaining it—which increases the value of the warehouse to them. Performing tasks such as manipulating hierarchies and adding new data ensures that business users get exactly what they want out of the warehouse, and they can change it themselves when needed.
Tackling the reference data integration issues up-front also ensures that the warehouse is not dependent on any one source. After all, what happens to a warehouse built on a brittle, hardwired foundation when its primary data source is slated for obsolescence? What is chaos, if not doom, for the hardwired warehouse is barely a bump into the road for a warehouse built atop a firm master data foundation.