datascience@berkeley Blog RSS

Bridging the Divide between Unstructured and Structured Data

The following is a guest post from Jim Harris of OCDQ Blog.

Binary Code“When we first started digitizing our world in the 20th century,” Chris Taylor explained in his Wired article “What’s the Big Deal With Unstructured Data?,” “we first went after the low-hanging fruit of transactional data—accounting. It was a quick win to transfer spreadsheets of information into neat columns and rows.” The data in those neat columns and rows is what’s referred to as structured data.

Consider, for example, the transactional data involved when you purchase a smartphone. The data that describes the smartphone includes a manufacturer name, model name, display size, and sale price. The data that describes you includes your name, credit card number, billing address, and phone number.

These attributes split the data into columns, enabling, for example, the manufacturer name to be easily distinguished from the model name. The associated data values populate the rows under those columns. Rules can be applied to ensure that formats are consistent (e.g., phone numbers always have a hyphen separating the area code from the local number) and values are validated (e.g., a billing address is verified for postal delivery). Structuring data into columns and rows, which are often stored in relational database tables, makes data easier to use for tasks such as querying (e.g., using a billing address to geographically segment customers) and aggregating (e.g., calculating total sales by the model name to determine the best-selling smartphones).

In contrast, unstructured data, as Darin Stewart explained in his Gartner blog post “Big Content: The Unstructured Side of Big Data,” “does not conform to a specific, pre-defined data model. It tends to be the human-generated and people-oriented content that does not fit neatly into database tables.” More importantly, until relatively recently most of this human-generated and people-oriented content was not digitized, meaning it was not recorded as data in any format. Our conversations, for example, used to be verbal exchanges with no record of what was said or by whom. Nowadays, many of our conversations are conducted via data exchanges and recorded in a variety of formats such as emails, voicemails, text messages, and social networking status updates. Other examples of unstructured data include documents, blog posts, images, videos, and web search histories.

As Intel explained in their free paper “Big Data 101: Unstructured Data Analytics,” in recent years Apache Hadoop, which is open-source software (though many commercial versions of Hadoop are also available from several vendors) for reliable, scalable, distributed computing, has proven to be a cost-effective, fast, and massively scalable tool for handling unstructured data.

Hadoop is part of a category of tools referred to as NoSQL to differentiate them from relational databases. Unlike the predefined and predictable format used by relational databases, NoSQL data stores, such as the Hadoop Distributed File System (HDFS), can handle data in any format. A better term for unstructured data might be unpredictably structured data. This data format flexibility makes NoSQL data stores, such as HDFS, one of the most popular ways organizations are collecting unstructured data from a variety of sources.

Unstructured data is being collected because of its potential business value. “Content from the social stream,” for example, as Stewart explained, “can be a direct line into the hearts and minds of customers. Blogs, tweets, comments, and ratings are a reflection of the current state of public sentiment at any given point in time.”

However, in order to capitalize on this potential business value, organizations must bridge the divide between unstructured and structured data.

Returning to the earlier example, imagine that after your smartphone purchase you begin posting status updates on Twitter and Facebook complaining about the poor 4G coverage area of your mobile service provider. Assuming they detected those status updates and determined their negative sentiment, how are they supposed to know this feedback came from an actual customer, let alone a specific customer? And there’s the rub.

“Extraction, transformation, and loading (ETL) techniques from unstructured data sources will still need to be written in order to make the data into usable and business-ready forms,” Bill Schoonmaker explained in his Forbes article “Unstructured Data Can Create Chaos.” “The data will need to be tied to valid business entities such as users, clients, or customers. Without the knowledge of what a piece of unstructured data is directly tied to, it will be difficult, if not impossible, to derive any real value from it.”