Chapter 5. Detection Patterns

The first phase of fault tolerance is detection. Faults and the errors that they cause must be detected. Detection must occur before any recovery or mitigating actions can be taken to tolerate their presence in the system. Waiting and letting unknown latent faults activate and cause that result in failures is not fault tolerant.

The patterns in this chapter help detect the presence of errors or failures and the faults that caused them. They provide a number of mechanisms to monitor the system and to detect if it is behaving erroneously. Two pairs of concepts drive detection at execution time. These are errors versus failures and a priori knowledge versus comparison of redundant elements, see Figure 27.

A priori detection uses constraints that are known in advance about the system to determine if some deviation from the normal situation of correctness exists. The range of results to be considered includes system states, results, and any side effects. If nothing is known about the range of results this method will obviously not work.

Much of the fault tolerant programming literature has focused on the second method, that of comparing redundant results. Redundancy (3) in Chapter 4 discussed introducing redundant elements into the system. Many purpose-built systems with custom hardware use redundant hardware to execute the same program and the results. The comparison might be done in real time by hardware matchers that look at internal, partial computation ...

Get Patterns for Fault Tolerant Software now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.