David Loshin, Cowbell, and the Myth of (Completely) Unstructured Data

David Loshin, Cowbell, and the Myth of (Completely) Unstructured Data

Mar 01, 2012 by in Data Management

Most people don’t think of a book as data.

I do.

In fact, while I was editing 101 Lightbulb Moments in Data Management, I used a very unorthodox, very structured approach to manage the hundreds of posts from the roundtablers on data-oriented topics. Why? The short answer: it made sense. The longer answer: Ultimately, I wanted the book to be balanced in a number ways, including:

  • number of posts per contributor
  • number of posts by topic
  • number of posts by contributor by topic

I strived to represent each of the contributors and topics as equally as possible. While I consider myself friendly with each member of this forum, I didn’t want someone complaining to me that Jim Harris’s 18 posts trumped his or her 14 posts. I also didn’t want the data quality section, for instance, to contain 100 pages while there were only 20 pages on an equally important topic like data governance.

And, I’ll admit it. I didn’t want to tick off Dylan Jones.

You don’t either.

Ever.

Also, I wanted to tell the folks at DataFlux exactly where we were light. I’d frequently tell Scott Batchelor that we needed more MDM content or “more Loshin.” (In case you were wondering, there’s no such thing as “enough Loshin.”)

It’s like cowbell.

Structure This!

Now, I’m big pivot table guy (link may not work due to SOPA protests.) I exported all WordPress posts from this site into a flat file, imported it into Excel, saved it as a proper workbook, and dutifully kept track of where we were at any given point. I could always tell you exactly how many lightbulb moments had been chosen, how many Jill Dyché had contributed (and about what), and even how many posts contained Rush references (just about all of them).

Rush is also like cowbell.

I seriously doubt that too many people have thought of editing a book in this manner, but I stand by my methods. You see, to me, just about everything is data. A book is no exception. Yes, the data is typically unstructured, but that doesn’t mean that it has to stay that way. Nor does it mean that you can’t apply a little structure.

No, I didn’t count how many times Jim Harris riffed on a song or the average length of David Loshin’s posts–although I could have. That would have been overkill. Still, being able to simply answer questions made managing the whole project much, much easier than it would otherwise have been.

Simon Says

I’d argue that most people would benefit from approaching their unstructured data in a similar manner. To wit, there’s no such thing as completely unstructured data. It’s a myth, a spook story.

You can always assign times, dates, and handles to tweets. You can use semantic analysis on blog posts and web pages.

Just because data is initially unstructured doesn’t mean that it has to stay that way.

Period.

Feedback

What say you?

Tell DataFlux your “Data Disaster Story” and receive a free copy of 101 Lightbulb Moments in Data Management.

4 Responses to “David Loshin, Cowbell, and the Myth of (Completely) Unstructured Data”

  1. Jim Harris

    Mar 01, 2012

    Well structured blog post, Phil.

    Yes, data management always needs a little more Cowbell, a little more Loshin, and a lot more Rush :-)

    Unstructured data represents the largest segment of the rising data volumes we are seeing today. I definitely agree that completely unstructured data is a myth, and I think the disruptive paradigm shift that we have face is reevaluating how much structure has to be imposed on data before it can be used.

    Historically, data had to be structured (and cleansed, transformed, integrated, etc.) before it was used. Not only is that approach becoming less practical because of how much data we are dealing with, but the reality is that data doesn’t always need a high degree of structure in order to be useful.

    And I think books are an excellent example of deriving value from somewhat unstructured data, which is why I used data management books to discuss deriving value from unstructured data in my video post DQ-View: Data Is as Data Does.

    Exit the Cowbell Warrior

    Reply to this comment
  2. marc smith

    Mar 01, 2012

    with all due respect to David, Cowbell is like Rush.

    Reply to this comment
  3. David Loshin

    Mar 02, 2012

    Somewhere around 250 words is the average.

    Reply to this comment
  4. Phil Simon

    Mar 02, 2012

    Thanks for the comments, guys. If a book can be made more structured, then I think just about anything can.

    Reply to this comment

Leave a Reply