There is a lot of interest in unstructured data these days. Whether this is the more “traditional” forms like video and other large bit-stream objects or things like clickstream data from the web, lots of people are devoting lots of time and effort towards the creation and management of systems to collect and store this data. Indeed, the explosion of this category of data is, at least in part, what is making “big data” so big.
However, it is easy to forget that unstructured data must have structure added to it in order to be useful. Simply collecting a lot of data that contains what visitors to a website clicked on is not terribly useful unless I can start to analyze that data by things like what task the click performed, where clicks on links took the user, and who the user was in the first place. This means classification – classification of the links, the web pages, the users, where they came from, etc.
The interesting thing about some of this is while some raw information may be obtainable (URLs, information offered by the user or stored in cookies, etc.) how to classify it often is not present or inferable from the captured data. For example, if we are interested in understanding what sort of things are of interest to visitors to a website we need to classify the internal links that they follow and the places they came from and leave to. Simply presenting a list of URLs in a report is far less useful than grouping them by internal vs. external, topic, whether they are competitors, search engines or partners, etc. This requires some input from people – to review the data, create the classifications and group the captured data. In other words, we need people to add structural context.
This points out a simple fact – that managing big data (even structured big data) really only differs from any other kind of data management by one critical factor: the number of rows. The importance of a business oriented model for management and storage is crucial. This model should not simply collect every bit of information that might be needed, it must collect all commercially significant data in a structure that lends itself to analysis by the business, using criteria and metrics determined to be relevant to the business (for more on Business Relevance see Mike Wheeler’s blog post). In other words, it is not enough to know what information you have, you also need to know how to use it to run the business.
This is where I have seen so many projects falter. They collect lots of data, building huge databases that require farms of servers to store and process, but still are unable to answer even basic business questions. It isn’t because the data isn’t collected, it is because the system lacks the structure required to make the data into useful information. The analytical structure must be created and managed in close alignment with the commercial goals of the enterprise. The business users should be empowered to create and manage the classifications and values themselves but if not, at a minimum, they need to be closely involved in the definition and periodic review and update.
Grouping, sorting and classifying information is how we find patterns and gain insight into our business and the markets in which it operates. Without a well-defined and relevant analytical structure, even the most complete and robust collection of data, structured, unstructured, or both, is nearly useless.