I just read a recent article in Information Management entitled “Who in the World Needs a Data Warehouse?” On seeing the title, I immediately thought this would be an article about how companies can simply load their data in memory to avoid building a data warehouse entirely. As I suspected, that is the author’s proposed option. The article listed off a variety of issues with building a data warehouse, including capturing requirements and dealing with scope creep, ETL design complexity, data integration and data quality. In several prior posts, I discussed ways to handle capturing requirements and the ETL through business modeling and automation. But in this post I want to discuss data quality. The article says “data that is housed in the data warehouse is often either incorrect or inconsistent.” If in-memory analytics is the answer to this problem, I ask: what’s the difference if you load incorrect and inconsistent data into memory? How does simply moving it off disk fix this problem? The answer is: it doesn’t!
At the very highest level, the inconsistent and incorrect data in the warehouse is due to data quality issues with master data. Fields are missing, or fields for the same thing are named differently in different data sources. There may be multiple names and addresses for the same customer, or there may be conflicting product codes. And these issues can be huge – for example one Kalido customer is managing a complex dataset comprising more than 500,000 product names and SKUs from over 5,000 suppliers and more than 100 retail customers.
One obvious way to address part of the data quality problem is to use a tool such as those provided by Kalido partner Trillium Software that can cleanse records before they are loaded. But another area that needs to be addressed is use of a master data management tool. These tools frequently offer data matching capabilities along with data stewardship and workflow to make the data quality process sustainable. Some of the MDM tools are focused only on customer data, some are focused only on product data, and a few are usable for any data domain – financial, employee, supplier, location, KPI, etc. It’s a good chance that your data warehouse will include data from many different domains, so for a data warehousing project it makes sense to use a multi-domain capable MDM tool as a strategy for ensuring clean, consistent and accurate data is loading into your warehouse.
In-memory tools are great for rapid user access performance, but it is not a cure-all for the issues outlined in the Information Management article. What in-memory tools offer in terms of ease of use and speed for gathering information and presenting it to users, they lack in terms of resolving consistency and accuracy in the incoming master data from operational sources. Companies with multiple data sources and big data quality problems instead need a solution to ensure that the master data is correct and consistent. When you combine a master data management capability with a robust and agile data warehouse you can address many of the issues noted in the article. For any BI application fed from multiple data sources, companies are going to need this type of architecture behind their in-memory tools because of the potentially wide-range of data sources, and therefore increased propensity to run into poor quality master data.
If you are suffering from master data quality problems in your current data warehouse, please don’t think those will magically disappear by moving data to an in-memory solution. You must address your MDM issues. An integrated MDM solution with a data warehouse is an attractive solution to evaluate.