Big Data: Structure and Quality
May 16, 2012 by Jim Harris in Big Data
It’s Wednesday on Big Data Week, and Jim Harris expands upon the definition of Big Data by looking at quality and structure…
In a previous post, I noted, as many others also have, that Big Data is about more than just data volume. Its other two most commonly cited characteristics – variety and velocity – further complicate the big data challenge. But before I continue, please permit me to greatly oversimplify traditional data management as a two-step process:
- Structure
- Qualify
The easiest example of step one is the relational model, which has dominated the data management industry since the 1980s, fostering the long-held belief that data has to be structured before it can be used. The second step is the long-held belief, at least among data quality professionals, that data also has to qualified before it can be used (verifying completeness, validity, accuracy, etc.).
These two steps require a methodical approach that is slower than the velocity of big data, which refers to not only how fast data is being produced, but also how fast data must be processed to meet demand. And the biggest increase in volume comes from the variety of big data, which consists mostly of unstructured or semi-structured data. So, from my perspective, most of the big data angst is about the fear that traditional data management techniques can not effectively and efficiently structure and qualify big data before it can be used.
Different Uses, Different Approaches
We must acknowledge that some big data use cases differ considerably from traditional use cases, requiring us to reevaluate how we structure big data and how we assess the quality of big data.
An excellent example is sentiment analysis, which analyzes large amounts of largely unstructured data in an attempt to understand how customers think and feel about products and services. By its very nature, determining the sentiments your customers have requires a different data management approach than, for example, determining the number of duplicate customer records you have.
In his book The Secret Life of Pronouns: What Our Words Say About Us, social psychologist and language expert James Pennebaker shared insights from his groundbreaking research in computational linguistics – in essence, counting the frequency of words we use – to show that our language carries secrets about, among other things, our thoughts and feelings.
“Sociolinguists,” Pennebaker explained, “focus on broad social dimensions such as gender, race, social class, and power. Their approach is qualitative, involving recording and analyzing conversations on a case-by-case basis. It is slow, painstaking work. Over the course of a year, a good sociolinguist may analyze only a few interactions. Whereas the qualitative approach is powerful at getting an in-depth understanding of a small group of interactions, the methods are not designed to get an accurate picture of an entire society or culture. This is where computer-based text analysis methods can help. By analyzing the blogs of hundreds of thousands of people, for example, the computer-based methods can quickly determine the nature of gender differences as a function of age, class, native language, region, and other domains. In other words, a relatively slow but careful qualitative approach can give us an in-depth view of a small group of people; a computer-based quantitative approach provides a broader social and cultural perspective. The two methods, then, complement each other in ways that the two research camps often fail to appreciate.”
So, the loosely-structured, quantitative approach of counting and categorizing the individual words in a very large data set to assess a general, but broad, aggregated sentiment (e.g., positive, negative, or neutral) is a very different approach than the highly-structured, qualitative approach of evaluating the complete sentences and paragraphs in a very small data set to assess a more specific, but narrow, detailed sentiment (i.e., providing more comprehensive and contextual feedback).
Another excellent example of a data management solution that relies on a loosely-structured, quantitative approach is Internet search engines, which rank their results primarily according to the frequency with which the key words in your search term appear on websites. Of course, as we all know, this doesn’t always guarantee the highest quality search results, but it does enable us to very quickly search a very large number of websites from a wide variety of sources.
Discussing Structure and Quality
Big data discussions often turn into debates due to the misperception that big data always requires sacrificing structured data quality in favor of un-or-semi-structured data quantity. But the reality is that sometimes one of these approaches will be more applicable for certain use cases, and other times, these approaches will complement each other in ways that data management professionals may fail to appreciate, or perhaps just reflexively refuse to accept.
In my opinion, in order to move the big data discussion forward, and, more importantly, enable our organizations to develop strategies for using big data to solve business problems, we have to stop fiercely defending our traditional data management perspectives about structure and quality.
it’s “Big Data Week” at the Roundtable! Read what our experts are saying about Big Data!




