Chapter 5. Recipe: Pick Failure-Recovery Strategies

Idea in Brief

Most streaming applications move data through multiple processing stages. In many cases, events are landed in a queue and then read by downstream components. Those components might write new data back to a queue as they process or they might directly stream data to their downstream components. Building a reliable data pipeline requires designing failure-recovery strategies.

With multiple processing stages connected, eventually one stage will fail, become unreachable, or otherwise unavailable. When this occurs, the other stages continue to receive data. When the failed component comes back online, typically it must recover some previous state and then begin processing new events. This recipe discusses where to resume processing.

Note

The idempotency recipe discusses a specific technique to achieve exactly-once semantics.

Additionally, for processing pipelines that are horizontally scaled, where each stage has multiple servers or process running in parallel, a subset of servers within the cluster can fail. In this case, the failed server needs to be recovered, or its work needs to be reassigned.

There are a few factors that complicate these problems and lead to different trade-offs.

First, it is usually uncertain what the last processed event was. It is typically not technologically feasible, for example, to two-phase commit the event processing across all pipeline components. Typically, unreliable communication ...

Get Fast Data: Smart and at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.