Chapter 12

Fault-Tolerant Systems

12.1 Introduction

A fault is the manifestation of an unexpected behavior, and fault tolerance is a mechanism that masks or restores the expected behavior of a system following the occurrence of faults. Attention to fault tolerance or dependability has drastically increased over the recent years due to our increased dependence on computers to perform critical as well as noncritical tasks. Also, the increase in the scale of such systems indirectly contributes to the rising number of faults. Advances in hardware engineering can make the individual components more dependable, but it cannot eliminate them altogether. Bad system designs and behavioral patterns like mobility can also contribute to failures.

Historically, ...

Get Distributed Systems, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.