3Imperfect Fault Coverage

Many systems especially those used in life‐critical or mission‐critical applications such as aerospace, flight controls, nuclear plants, data storage systems and communication systems are fault‐tolerant systems (FTSs) [1, 2]. An FTS can continue to perform its function correctly even in the presence of software errors or hardware failures [3, 4]. Its development typically requires using certain form of redundancy and an automatic reconfiguration and recovery mechanism to restore the system function in the case of the occurrence of a component failure. The mechanism itself (involving fault detection, fault location, fault isolation, and fault recovery) is often not perfect; it can fail such that the system cannot adequately detect, locate, isolate or recover from a component fault happening in the system. The uncovered component fault may propagate through the system and further cause the failure of the entire system or subsystem in spite of the presence of adequate redundancies. Such behavior is referred to as imperfect coverage (IPC) [57]. Since 1969, the IPC concept has been widely recognized as a significant concern in the reliability field.

Consider a hot standby server system with a primary server and a standby server. The standby server is switched online and operating upon the malfunction of the primary server. Under an ideal circumstance, the entire system functions correctly as long as one of the two servers functions correctly. However, in ...

Get Dynamic System Reliability now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.