Chapter 17

Self-Stabilizing Systems

17.1 Introduction

In large-scale distributed systems, failures and perturbations are expected events and not catastrophic exceptions. External intervention to restore normal operation or to perform a system configuration is difficult, and it will only get worse in the future. Therefore, means of recovery have to be built in.

Fault-tolerance techniques can be divided into two broad classes: masking and nonmasking. Certain types of applications call for masking type of tolerance, where the effect of the failure is completely invisible to the application; these include safety–critical systems, some real-time systems, and certain sensitive database applications in the financial world. For others, nonmasking tolerance ...

Get Distributed Systems, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.