Chapter 2. The Emergence of Streaming

Fast-forward to the last few years. Now imagine a scenario where Google still relies on batch processing to update its search index. Web crawlers constantly provide data on web page content, but the search index is only updated every hour, let’s say.

Suppose a major news story breaks and someone does a Google search for information about it, assuming they will find the latest updates on a news website. They will find nothing if it takes up to an hour for the next update to the index that reflects these changes. Meanwhile, suppose that Microsoft Bing does incremental updates to its search index as changes arrive, so Bing can serve results for breaking news searches. Obviously, Google is at a big disadvantage.

I like this example because indexing a corpus of documents can be implemented very efficiently and effectively with batch-mode processing, but a streaming approach offers the competitive advantage of timeliness. Couple this scenario with problems that are more obviously “real time,” like location-aware mobile apps and detecting fraudulent financial activity as it happens, and you can see why streaming is so hot right now.

However, streaming imposes significant new operational challenges that go far beyond just making batch systems run faster or more frequently. While batch jobs might run for hours, streaming jobs might run for weeks, months, even years. Rare events like network partitions, hardware failures, and data spikes become inevitable ...

Get Fast Data Architectures for Streaming Applications, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.