Chapter 24. Checkpointing

The act of checkpointing consists of regularly saving the information necessary to restart a stateful streaming application without losing information and without requiring the reprocessing of all of the data seen up to that point.

Checkpointing is a subject that merits particular attention when dealing with stateful Spark Streaming applications. Without checkpointing, restarting a stateful streaming application would require us to rebuild the state up to the point where the application previously stopped. In the case of a window operation, that rebuild process might potentially consist of hours of data, which would require more massive intermediate storage. The more challenging case is when we are implementing arbitrary stateful aggregation, as we saw in Chapter 22. Without checkpoints, even a simple stateful application, like counting the number of visitors to each page of a website, would need to reprocess all data ever seen to rebuild its state to a consistent level; a challenge that might range from very difficult to impossible, as the data necessary might no longer be available in the system.

However, checkpoints are not free. The checkpoint operation poses additional requirements to the streaming application concerning the storage required to maintain the checkpoint data and the impact that this recurrent operation has on the performance of the application.

In this chapter, we discuss the necessary considerations to set up and use checkpointing ...

Get Stream Processing with Apache Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.