Last week Kalido hosted a webinar entitled “Data Scientist: Your Must-Have Business Investment NOW.” The panelists (Carla Gentry, David Smith and Gregory Piatetsky) spoke eloquently about the role of the Data Scientist and the types of analysis they perform.
In case you missed the webinar, a Data Scientist is someone that looks at data with an eye to trends and patterns. They are focused upon predicting future results, rather than summarizing past performance – as is the case in traditional business intelligence. Data Scientists slice and dice data to uncover insights that are hidden within the data. Listening to the webinar made me wonder, “where do Data Scientists get their data?”
This is not a simple question. Data Science depends upon data that has a number of key qualities. For example, it needs to be accurate. In fact, there was some discussion about how bad data can make the Data Scientist’s job more difficult. Sometimes, what looks like a trend is actually an artifact of bad data. Data Scientists always have to comb through their analysis, looking for any data anomaly that can skew results.
So, we might think that data warehouses would be a good source of data for their activities. A well designed data warehouse will only contain valid and complete data. However, Data Science also depends upon data that is current, reflecting the very latest developments in the business. Traditional data warehouses have difficulty staying fully current due to the latency caused by the extensive error checking that is required. But, even if we can feed the very latest data into the warehouse, there is still another problem – will the data the Data Scientist needs even be available in the data warehouse at all?
By the very nature of their work, Data Scientists can’t really predict what data they will need. Their analysis is inherently iterative. Each question they answer often leads to more questions. Without well-established requirements it would appear that a data warehouse can’t possibly be used as a foundation for Data Science.
This is, of course, where an agile data warehouse comes in. If we can implement a data warehouse that is quick to build and easy to change, then we can keep pace with the shifting needs of the Data Scientist. We can deploy new data elements, revise the data relationships and include new attributes, turning these changes around in hours, rather than days or even weeks.
How can we build such an agile data infrastructure? Only through automation of some of the basic and mundane tasks required. We need a way to get from logical model to loaded data table, ready for queries, without involving data architects, ETL programmers and DBAs. Do such tools exist? Yes, they do. Attend my webinar next week “Rapid Data Integration Tools and Methods” to hear about them and how they can help provide Data Scientists with the data they need, and also insure the success of your data warehouse in general.