As we continue Big Data Week at the Data Roundtable, David Loshin steps in to examine the benefits of analysis platforms like Hadoop and the problems with latency …
I will cut to the chase here: the current hysteria regarding the all-encompassing benefits of big data analytics, along with its catchy 3 (or 4) “V” formula of “Volume, Variety, Velocity” (and “Value”) is essentially predicated on the expectation that given an operational environment, parallel file system, and parallelizing programming environment, one can accomplish analyses in a much shorter time frame. This compressed “time to information” would enable real-time decision-making (or at least “faster” decision-making), and so each data management tool vendor is working as hard as it can to demonstrate that its tools can be aligned with Hadoop and thereby support “big data.”
There are benefits to providing commodity-based high-performance computing platform for algorithmic implementation. And as long as the massive volumes of data are available, these high-performance computing engines (such as those developed using Hadoop) should perform reasonably well. The bottleneck, though, is the data.
Using a big data analysis platform such as one built using Hadoop, you benefit from the inherent parallelization of execution, but the hidden cost is the latency associated with data access. Accessing data from disk is slow enough, but imagine trying to access and then pump petabytes through the limited network bandwidth to stream the data into the analytical platform. The latency associated with data motion is often glossed over when reporting application execution times, but I wonder whether the fantastic execution times would be as good if you added in the latency associated with the data access.
The challenge is that the latency problem gets worse as the data volumes grow, so if you are developing a big data application, testing it using a reasonably-sized data set, you might not even notice the tax that the data loading process is assessing. However, you must consider the scalability issues, and if your parallel environment hasn’t been fitted with equally scalable networking and I/O channels, you will eventually feel the pinch. In fact, MapReduce is not insensitive to the latency issue, since each transition between Map and Reduce phases will, by necessity, require broadcasting data across the system’s network.
Since data access and exchange latency is the limiting factor for big data performance, it has the potential to be the cloud that rains on the big data parade. That being said, alleviating the latency bottleneck is likely to become a key issue for anyone who wants to tackle really big data. That means high performance data integration.
Yes, even though the term “data integration” is not as sexy or mesmerizing as “big data,” it might be the name of the technology that enables high performance analytics. And to accommodate the massive volumes, it means a number of key considerations for companies providing data integration technology. Here are some things to think about:
- Optimizing communication channels to provide pipelined streaming of data;
- Embedding computation within the communication network (remember “active networks” research?)
- Data federation and virtualization
- High-speed virtual data caching
- Query optimization prior to “pushing-down” to the source
- Integrated event stream processing within the integration layers
- Dynamic data realignment (to take advantage of alternate record layouts and orientations)
- Bulk data loading
- Data replication
This is just a short laundry list. I am pretty convinced that solutions for high performance computation without a strategy for high performance data movement is bound to be bound by the latent latency inherent in data access.
it’s “Big Data Week” at the Roundtable! Read what our experts are saying about Big Data!