Chapter 17

Self-Stabilizing Systems

17.1 Introduction

In large-scale distributed systems, failures and perturbations are expected events and not catastrophic exceptions. External intervention to restore normal operation or to perform a system configuration is difficult, and it will only get worse in the future. Therefore, means of recovery have to be built in.

Fault-tolerance techniques can be divided into two broad classes: masking and nonmasking. Certain types of applications call for masking type of tolerance, where the effect of the failure is completely invisible to the application; these include safety–critical systems, some real-time systems, and certain sensitive database applications in the financial world. For others, nonmasking tolerance ...

Get Distributed Systems, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.