As a member of an expert panel about predictions for the upcoming year and adoption of innovation at the recent DataFlux IDEAS conference, I was struck by the degree to which two specific themes cropped up numerous times, namely “big data” and “Hadoop.” Those terms are the buzzwords du jour recently, and a scan of technology-related news items and press releases are chock-a-block with announcements about support for “big data” in general and “Hadoop” in particular. The challenge for most of us is to filter out what is media hype and PR noise from what is really relevant to our own daily activities.
Everyone is talking about big data as if the ever-increasing rate of data volume growth is a big surprise, especially as new sources spout never-ending streams of “data.” The quotation marks are my attempt at a droll sense of humor suggesting that, yes, a billion twitter users are generating streams of data, but how much of that translates into actionable content and information? And of course (because like many others I have scanned the headlines filtered through various news aggregation sites that allow me to draw conclusions without having to read anything beyond the headlines), I know that the solution to the big data challenge is our new friend Hadoop, whose rapidly materializing ubiquity just demonstrates its potential value.
But what do we really use Hadoop for? And how does it address big data? And what does that mean from a data quality perspective? That is the thought-stream for the next set of entries.
By the way, here is another aside – fellow data roundtable blogger and expert-panel participant Rich Murnane commented on his measurement of the number of DICE postings asking for experience in Hadoop as opposed to ones looking for more mundane skills, such as SQL Server. He noted a rough 10% growth in Hadoop-related positions, while others stayed pretty level over a short time frame. I had two conflicting thoughts: either that means there are a lot of new Hadoop projects on the horizon, or that the pool of skilled Hadoop programmer is really small.
More information on those postings might have provided more insight. Quick thought experiment question: If I have 10 job postings, does that mean that there are 10 open positions? Not necessarily, since multiple recruiters may have been approached to fill the same position. A review of the content of the jobs postings would shed more light on that question. For example, if two postings are posted by recruiters and not by the hiring company, AND the locations are the same, AND the job descriptions are the same or similar, then those two postings are probably for the same position.
And how could I figure that out on a grand scale as to the *real* number of open Hadoop positions in relation to the number of Hadoop job postings? I could subject the data to text analysis. Using Hadoop, of course
new book, 101 Lightbulb Moments
in Data Management: Tales from the