Chapter 5Processing Streaming Data

Now that data is flowing through a data collection system, it must be processed. The original use case for both Kafka and Flume specified Hadoop as the processing system. Hadoop is, of course, a batch system. Although it is very good at what it does, it is hard to achieve processing rates with latencies shorter than about 5 minutes.

The primary source of this limit on the rate of batch processing is startup and shutdown cost. When a Hadoop job starts, a set of input splits is first obtained from the input source (usually the Hadoop Distributed File System, known as HDFS, but potentially other locations). Input splits are parceled into separate mapper tasks by the Job Tracker, which may involve starting new virtual machine instances on the worker nodes. Then there is the shuffle, sort, and reduce phase.

Although each of these steps is fairly small, they add up. A typical job start time requires somewhere between 10 and 30 seconds of “wall time,” depending on the nature of the cluster. Hadoop 2 actually adds more time to the total because it needs to spin up an Application Manager to manage the job. For a batch job that is going to run for 30 minutes or an hour, this startup time is negligible and can be completely ignored for performance tuning. For a job that is running every 5 minutes, 30 seconds of start time represents a 10 percent loss of performance.

Real-time processing frameworks, the subject of this chapter, get around this setup and ...

Get Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.