Chapter 4. Handling Time
One crucial difference between programming applications for a stream processor and a batch processor is the need to explicitly handle time. Let us take a very simple application: counting. We have a never-ending stream of events coming in (e.g., tweets, clicks, transactions), and we want to group the events by a key, and periodically (say, every hour) output the count of the distinct events for each key. This is the proverbial application for “big data” that is analogous to the infamous word-counting example for MapReduce.
Counting with Batch and Lambda Architectures
Even if this seems simple, counting is surprisingly difficult at scale and in practice, and, of course, appears everywhere. Other analytics, such as aggregations or operations on Online Analytical Processing (OLAP) cubes, are simple generalizations of counting. Using a traditional batch-processing architecture, we would implement this as shown in Figure 4-1.
Figure 4-1. Implementing continuous applications using periodic batch jobs. Data is continuously sliced into files, possibly on an hourly basis, and batch jobs are run with these files as input, giving an impression of a continuous processing of incoming data.
In this architecture, a continuous data ingestion pipeline creates files (typically stored in a distributed file store such as Hadoop Distributed File System [HDFS] or MapR-FS) ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access