Chapter 6. Fault Tolerance
Such a strategist was the king that he had a contingency plan for his contingency plan, and even, if circumstances required, a contingency plan for his contingency plan’s contingency plan.
Frank Beddor, Seeing Redd
The Principle: Anticipate and address the potential compromise and failure of system elements and security controls.
Key Question: What happens if this fails?
Related Concepts: Resilience, Failsafe Defaults, Defense in Depth, Revocability, Incident Response, Business Continuity and Disaster Recovery, Murphy’s Law
Fault Tolerance is the Principle of operating with the assumption that systems, processes, and people will fail, and then taking steps to address those contingencies. By anticipating failure, Fault Tolerance keeps the practitioner one step ahead, with backup plans ready, so that mission-critical functionalities continue no matter what.
Failures are a part of life. You should enter each scenario asking, “What happens when this fails?” and, “What can I do about it?” Modern information systems operate in extremely volatile environments and, as such, have to take some amount of failure as an inevitability. But that doesn’t mean that those failures need to be crippling, or even more than a minor inconvenience. Fault Tolerance is about building the mindset of always preparing for the worst, so that even when failures occur, the critical work keeps getting done.