Building an Agile Data Warehouse – Part 2

building-an-agile-dw-pt2-300x256

If you accept the concepts of Agile Data Warehousing you quickly will discover that the traditional toolset (modeling tools, ETL tools, etc.) bring little to the table to advance an agile methodology.  As we saw in part one, acquiring a pre-built model can help focus some of the requirements gathering, but it does little to enable the implementation of an agile methodology.  There are, however, a number of tools in the marketplace that do attempt to enable an agile data warehouse methodology.  These tools fall into one of two broad types: tools that employ, for lack of a better name, a “hyper-normalized” data structure and ones that employ a “hyper-generic” structure.

First, let’s define what we mean by each of these labels.  By “hyper-normalized” we mean tools that take data structures beyond third normal form and instead employ something like a data-vault type of structure.  This offers a number of attractive features to support agility.  For example, in these tools the data attributes are often extracted into “satellite” tables, so that the change history can be tracked at the attribute level, without the complexity of type 3 slowly changing dimension tables.  Foreign keys are also removed from the data, or “hub” tables, and moved to link tables.  This allows changes to data relationships to be implemented without any changes being required to the hub tables themselves.

Alternatively, in a “hyper-generic” solution, the principles of hyper-normalization are taken to their logical conclusion.  This means that all the “hub” data is moved to a single table (this is possible since hyper-normalization produces hub tables that are all of nearly identical column structure).  However, this also means that we need a way to identify which rows belong to which data category, or a list of “hub” tables.  This “table of tables” can be thought of as a hub table itself, so it can be also be stored in this single table, let’s call it the “thing table,” as well.   Since we only have the one “thing table” now, we also only need one link table, which is now a list of recursive links.  Any change to the warehouse, whether it is data updates or structural changes, is now reduced to a simple row update in one or two tables.  The illustration accompanying this article is an example of the logical model of a hyper-generic data store.

Both of these “hyper” approaches have as their goal the reduction of change complexity and the “collateral damage” that might occur in a change.  One of the problems faced by warehouse administrators is that a seemingly small change to the warehouse can often have effects on objects and processes unrelated to the change itself.  As a result, extensive planning is needed prior to a change, and extensive regression testing is required afterwards.  Both hyper-normalized and hyper-generic approaches narrow the effect of changes and minimize the impact on the larger warehouse.

However, both approaches also have their own issues.  These “hyper” solutions appear to simply trade the maintenance problem for a querying problem.  Take as an example a simple query against a 3NF structure consisting of employees, departments and job codes.  With foreign keys in the employee table, this query would require a simple 3 table join.  In the hyper normalized approach, the same query would require a minimum of an eight table join, and may require a join of dozens of tables, depending on the number of attribute tables involved.  Querying a hyper-generic structure requires just as many joins, with the difference that many of them are recursive to the “thing table.”  Since most users find 3NF structures hard to understand, there are significant usability issues. Such complex joins will also have scalability issues.  Therefore, both of these approaches will require the creation of a separate more de-normalized structure for query support.  It is how this conversion to a query and reporting layer occurs that is a key difference between these two different approaches.

The major advantage of a hyper-generic approach is that de-normalization process can be standardized, and even automated, since the hyper-generic data structure is totally consistent regardless of the data being stored.  In a hyper-normalized structure, the table structure is still implementation specific.  You still have a hub table for each category of information, plus a large number of satellite and link tables, all variable, dependent on the actual data being stored.  This means that the ETL to both load and query this structure is variable as well and therefore subject to errors when the model is updated.  In the hyper-generic approach the central warehouse structure is constant regardless of the information model.  This permits the construction of fixed code to load the data and to extract it into a more conventional model for BI.

Hyper-normalized based products do attempt to solve some this and other inherent problems; namely that designing the hyper-normalized physical model is hard and translating it to a BI-accessible form is extremely difficult.  Regardless of the tools provided, the fact remains that either querying the hyper-normalized structure directly, or converting it to a more conventional structure via ETL, leaves the user with the problem of managing the changes required in the ‘extraction’ phase.

This is not an issue in hyper-generic tools like Kalido.  By taking advantage of the fact that the core data structure for every Kalido warehouse is identical, regardless of the information model implemented, the transformation of the hyper-generic structure into a more traditional query structure is totally automated.  The information model drives the structure of the query and reporting schema and the product is able to both create that structure and populate it using totally standard code that is consistent across all warehouses, everywhere.

In fact, this means that the reporting schema (star, snowflake, 3NF or something else) can be reconstructed on demand.  This allows users to not only change the data model, but also the degree of normalization in the reporting schema.  The totally consistent core structure of hyper-generic allows one to automate the entire conversion to a reporting schema, “solving” both the maintenance and the querying problem.  The result is an enormous gain in agility, allowing changes to be implemented and results obtained in days, rather than weeks.

Both the hyper-normalized and hyper-generic approaches are somewhat radical.  Traditional 3NF data structures and the tools to support them have been around for a long time.  However, there are nearly a thousand warehouses in operation today that employ one of these approaches.  Traditional tools work well enough in a traditional waterfall methodology, but to truly leverage the advantages of an agile methodology requires new tools that can truly support agility.  Part of the agility advantage comes from flexibility and the ability to manage change.  But the ability to automate processes and thereby significantly reduce the time required to plan, implement and test changes is where the real payoff from an agile methodology is realized.  If automation is the key, then hyper-generic is the preferable approach.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply