In large-scale distributed systems, failures and perturbations are expected events and not catastrophic exceptions. External intervention to restore normal operation or to perform a system configuration is difficult, and it will only get worse in the future. Therefore, means of recovery have to be built in.
Fault-tolerance techniques can be divided into two broad classes: masking and nonmasking. Certain types of applications call for masking type of tolerance, where the effect of the failure is completely invisible to the application; these include safety–critical systems, some real-time systems, and certain sensitive database applications in the financial world. For others, nonmasking tolerance ...