Readers of our corporate blog and followers of the Kalido message know we have been promoting “No ETL” in the context of data warehousing. Inevitably this leads to ETL tool vendor outrage at the notion that you don’t need ETL. So let me be clear: we believe you will always need the concept of Extract, Transform and Load. However, most ETL tools do not automate many data integration concepts – they just provide a mechanism for a human to create that manually. Our approach is to keep the concept of ETL – extract data from multiple sources, transform it and then load that data into the data warehouse tables – but to automate this process. With Kalido, you don’t always need a separate ETL tool to do it.
An article I read this week expressed this vendor outrage and offered three steps to freeing yourself from the ETL burden. The article focuses on ETL processing workloads as the main problem and recommends Hadoop as an alternative place to run those workloads. I’d like to suggest that while batch processing significant amounts of data is definitely an issue for loading the data warehouse, there are other, more time-consuming problems that are the real burden to not only building a data warehouse rapidly, but also in maintaining it over time to add new sources of data, and to keep it a reflection of the way your organization operates today.
While it may be perfectly valid to consider using Hadoop to reduce the time spent on processing ETL programs for the data warehouse, there is still one other very real, very big burden: you still need to write thousands of ETL programs to build and maintain the data warehouse. How can you do this faster? How can you reduce the effort, the steps, the number of jobs? (I spoke to someone at a recent data warehouse event who said they were working on reducing the 5,000 ETL jobs that manage their EDW. Yikes!). If your goal is to become more agile in terms of responding to the business – and it should be – then Hadoop by itself isn’t going to reduce the steps required or the number of jobs needed. When you load customer and account data, someone still has to tell the load job to do a lookup to verify if it is a valid account, if it is a valid customer name, etc. If Hadoop can process this faster, great. But every time you re-org or a requirement changes, how many ETL jobs does a human have to review, re-code, test and debug? Our survey of TDWI attendees showed that 63.2% take over a month to handle a change in the warehouse due to a re-org, and half of these take more than three months. It’s a good bet that the vast majority of that time is spent spelunking through ETL code to set it up.
How do you increase consistency which leads to reduced operational costs? This is where real money is spent. Our survey of TDWI Conference attendees further showed that nearly 26% – and sometimes more than that – spend more than one million dollars per year to support their data warehouse. In Step 2 in the article, one cited example is to use Hadoop for slowly changing dimensions processing. But that does not eliminate any of the expense for the developer having to individually design SCD object by object and the even greater expense of modifying that manual logic when there is a change. This is the sort of thing that is ripe for automation.
The recommended approach in the article sounds to me like throwing more software at the problem, and with the limitations the author points out in Step 3, it isn’t even a complete set of software. Doesn’t this blunt the entire argument? Move your ETL processes to Hadoop but use several other tools to load tables, ingest logs, develop transformations, handle security, monitoring and administration – that sounds like adding to the burden, not subtracting from it.
The activity of ETL still very much applies in data warehousing as well as many other initiative such as ERP instance consolidations, system migrations, etc. But in the area of data warehousing, we’ve got a better mousetrap. Our automated approach takes a lot of the human labor out of coding the repetitive tasks that must be manually handled in ETL jobs. This is where the real burden is.