Most people don’t think of a book as data.
In fact, while I was editing 101 Lightbulb Moments in Data Management, I used a very unorthodox, very structured approach to manage the hundreds of posts from the roundtablers on data-oriented topics. Why? The short answer: it made sense. The longer answer: Ultimately, I wanted the book to be balanced in a number ways, including:
- number of posts per contributor
- number of posts by topic
- number of posts by contributor by topic
I strived to represent each of the contributors and topics as equally as possible. While I consider myself friendly with each member of this forum, I didn’t want someone complaining to me that Jim Harris’s 18 posts trumped his or her 14 posts. I also didn’t want the data quality section, for instance, to contain 100 pages while there were only 20 pages on an equally important topic like data governance.
And, I’ll admit it. I didn’t want to tick off Dylan Jones.
You don’t either.
Also, I wanted to tell the folks at DataFlux exactly where we were light. I’d frequently tell Scott Batchelor that we needed more MDM content or “more Loshin.” (In case you were wondering, there’s no such thing as “enough Loshin.”)
It’s like cowbell.
Now, I’m big pivot table guy (link may not work due to SOPA protests.) I exported all WordPress posts from this site into a flat file, imported it into Excel, saved it as a proper workbook, and dutifully kept track of where we were at any given point. I could always tell you exactly how many lightbulb moments had been chosen, how many Jill Dyché had contributed (and about what), and even how many posts contained Rush references (just about all of them).
Rush is also like cowbell.
I seriously doubt that too many people have thought of editing a book in this manner, but I stand by my methods. You see, to me, just about everything is data. A book is no exception. Yes, the data is typically unstructured, but that doesn’t mean that it has to stay that way. Nor does it mean that you can’t apply a little structure.
No, I didn’t count how many times Jim Harris riffed on a song or the average length of David Loshin’s posts–although I could have. That would have been overkill. Still, being able to simply answer questions made managing the whole project much, much easier than it would otherwise have been.
I’d argue that most people would benefit from approaching their unstructured data in a similar manner. To wit, there’s no such thing as completely unstructured data. It’s a myth, a spook story.
You can always assign times, dates, and handles to tweets. You can use semantic analysis on blog posts and web pages.
Just because data is initially unstructured doesn’t mean that it has to stay that way.
What say you?