One of my kids’ favorite books is the Three Little Wolves and the Big Bad Pig, which upends the classic tale of the three little pigs and the big bad wolf. There’re good lessons in it relevant to data management.
Everywhere we hear the talk of Big Data, both about handling the explosive growth of digital data, and about harnessing the data to obtain insights. Thus far, technology deployed to handle Big Data has emphasized storage density and query performance. For storage density, high capacity disk drives and compression have led to rapid declines in price-per-terabyte, a common metric for evaluating analytical databases. To increase query performance, vendors are racing to take advantage of new hardware such as flash memory, and software techniques such as MapReduce and columnar structure. Bigger and faster, in other words, is how we’re taking on Big Data.
This reminds of the three little wolves’ failed attempts to keep the Big Bad Pig at bay. They build stronger and stronger houses with more and more advanced material: from bricks, to concrete, and finally to a fortress made with iron bars, armor plates, and 37 heavy metal padlocks.
Let’s try to get a better understanding of the nature of Big Bad Data. Curt Monash wrote about the difference between machine-generated data and human-generated data. (For the purpose of this discussion, we’ll stick with structured data.) Event logs, sensor readings, call details, stock trades, and RFID and barcode scans, for example, are machine generated. The growth is “limited by capital budget and Moore’s law”. On the other hand, human-generated data, like customer records, orders, product information, require human fingers on keyboards. As such, the growth is constrained by the dexterity of our fingers and size of technology-enabled population. Curt rightly asserts that machine-generated data is driving the Big Data era, because machines’ ability to generate data is growing a lot faster than humans’.
We keep Big Data in databases because we want to analyze it. Otherwise, why incur the storage cost? This is where the paramount importance of human generated data comes in. Take bar-code scans at a cash register. The cash register generates a records consisting of the digits in the bar code, timestamp, the ID of the cash register, quantity and amount. And there’re lots and lots of these narrow records. However, there’s little useful information one can glean from this data alone. In order to get insights from the machine generated data, we need to marry it with human generated data. What type of product does the bar code represent and who’s the manufacturer? Which aisle and shelf is the product stacked? Who is the employee at the cash register? What’s the geographical location of the store? These are all human generated. Human-generated data provides context, without which machine-generated data cannot be harnessed for insight.
Let’s turn back to the three little wolves. In the end they tame the big bad pig by softening him with a house built of fragrant flowers. To tame big bad data, we need to make sure that human-generated data are of high quality. To get human beings to enter high quality data, we need to change the hearts and minds and creating a culture of data quality. This is the top objective of data governance.
Where the allegory break down is, we cannot just build houses of flowers. We also need big and fast boxes because data explosion is real. But big and fast boxes alone are not enough. We incur the cost of storing big data because we want to do something with it. And without high quality human-generated data – which is Small Data by comparison – we wouldn’t keep Big Data for all the tea leaves in our china teapot.