Chapter 17

Recovery

17.1 Introduction

In this chapter, we study methods for fault tolerance using checkpointing. A checkpoint can be local to a process or global in the system. A global checkpoint is simply a global state that is stored on the stable storage so that in the event of a failure the entire system can be rolled back to the global checkpoint and restarted. To record a global state, one could employ methods presented in Chapter 9. These methods, called coordinated checkpointing, can be efficiently implemented. However, there are two major disadvantages of using coordinated checkpoints:

1. There is the overhead of computing a global snapshot. When a coordinated checkpoint is taken, processes are forced to take their local checkpoints whenever the algorithm for coordinated checkpoint requires it. It is better for this decision to be local because then a process is free to take its local checkpoint whenever it is idle or the size of its state is small.

2. In case of a failure, the entire system is required to roll back. In particular, even those processes that never communicated with the process that failed are also required to roll back. This results in wasted computation and slow recovery.

An alternative method is to let processes take their local checkpoints at their will. During a failure-free mode of computation, this will result in an overhead on computation lower than that for coordinated checkpointing. In case of a failure, a suitable set of local checkpoints is ...

Get Concurrent and Distributed Computing in Java now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.