Much Ado About Loading

Recent chatter in the data warehousing industry has again raised the topic of data loading performance. With each new version of a database used for warehousing, the vendor trumpets their performance, trying to one-up their database competitor. There can be no doubt that physical movement of data from file to database is a critical component of the load, but is it really where the bulk of the time is spent on the nightly refresh? This raises the question: what does it mean to load the warehouse?

Here at Kalido we think a load is much, much more than blasting data into rows in a table. Instead, we think of a load as the execution of all the processes required to integrate data into a warehouse. Our load processing leverages the power of the database so we benefit from advances in database processing speeds – indeed with our cascaded warehouse management capability on the Netezza platform and our recent port to Teradata, we’ve seen very significant advances in performance compared to the traditional RDBMS platforms that Kalido supports. But because so much more is accomplished and automated during the load process, a data load for Kalido is a much bigger deal compared to what happens in a traditional approach.

Here is a sample list of things that Kalido handles during loading:

  • Collect stats on the staging table.
  • Find duplicates in inbound data. Redundancy needs to be eliminated to ensure integrity.
  • Generate new surrogate keys for new records (“OIDs” in Kalido terminology).
  • Find all code changes. Unfortunately, some systems re-use codes, which is difficult to manage in “traditional warehouses.”
  • Process invalid subtypes. The built in notion of inheritance in the Kalido business model ensures that, for instance, a ship-to address must have a physical address, while a bill-to address must have valid bank account details.
  • Lookup surrogate keys for existing data items, i.e. those records that already exist in the warehouse, but have since been updated.
  • Lookup parents  and make sure these parents are valid at the point in time they are being referenced to avoid a case such as a sale occurring prior to a customer account being authorized – or worse with versioning implemented – this data would be “missing” from your report.
  • Delta detection. Sometimes the same file is sent though more than once, or there is no easy way of identifying changed records in source.
  • Update of staging table to either flag successfully loaded records, for subsequent archival, or remove these from staging table altogether.
  • Update staging table to mark records that do not conform to the quality rules defined in the model – these are automatically suspended and violations reported and summarized in the suspense interface.
  • Process reference data and subtype counts.
  • Update metadata, keeping track of the volume of data loaded to each object and which load updated which item, so you have a bullet-proof audit-trail and can recover easily, should an error be identified in the data later on.
  • Evaluate recursive data for circular references. Process constraint-based rules criteria – really useful if you are defining market segments that are the intersection of the products you sell, via the channels that sell them and groups of key customers – but you don’t have the segment stamped on the transaction.
  • Replace historic states – to support those situations where the business changes their mind about the effective date of something such as when a promotion started or ended.
  • Evaluate temporal constraints for time hierarchies – to avoid the case where year-to-date this month overlaps with year-to-date last month, otherwise you’ll get erroneous results in your reports.
  • Support variant hierarchies for “what-if” analysis – say if you want to load in next year’s “proposed” territory structure, so you could see how this year’s revenue would be broken out.

This is a partial list. And it is also important to note that this is configured automatically and driven from the business model, so there is a potentially large savings in manual labor during set up to consider. There is clearly a little more to this than rapidly pumping in data from a file into a table.

The moral of the story is, not all loading routines are alike! Before you compare raw load processing speeds, find out what is going on during the load. Then factor in the effort to accomplish those tasks and add that to your total load performance measurement on a like-for-like basis. Even if we ignore the work saved in ETL programming needed to perform the steps listed above, you may find that the act of moving the data is blazingly fast, but you still have to do all the ETL processing to service the business need. This is what we mean by “loading.”

1 reply

Trackbacks & Pingbacks

  1. […] This post was mentioned on Twitter by Merv Adrian, Kalido. Kalido said: Much Ado About Loading […]

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply