Chapter 7

System Recovery

7.1 Causes of System Failure

A critical requirement for most TP systems is that they be up all the time; in other words, highly available. Such systems often are called “24 by 7” (or 24 × 7), since they are intended to run 24 hours per day, 7 days per week. Defining this concept more carefully, we say that a system is available if it is running correctly and yielding the expected results. The availability of a system is defined as the fraction of time that the system is available. Thus, a highly available system is one that, most of the time, is running correctly and yielding expected results.

Availability is reduced by two factors. One is the rate at which the system fails. By fails, we mean the system gives the wrong

