Chapter 7. Fault Tolerance
A common quality among developers is the element of perfectionism. We strive to build software that will be resilient against failure, and often we don’t want to admit that failures are a fact of life. The reality is that no matter how much we strive for perfection, there are always going to be elements that are out of our control. Even when we can control it, bugs will creep into the system. No system is perfect. We try to anticipate and predict every possible failure, but the truth is that the possibilities are endless. Even if we could build our software to be perfect, we must plan for hardware failures. And if we do plan for hardware failures, we must consider how network partitions affect our system. What happens when a hurricane wipes out our datacenter? We plan for the situation when every aspect of our software and hardware is behaving perfectly, but an external dependency outside of our control fails.
What might be better is to accept the fact that failures will occur. Instead of trying to handle every possible case, would it not be better to ensure that we can recover properly when the unexpected occurs? This will put us in a better place overall. Rather than trying to see the future and build a perfect system, we should instead try to build a system that is smart enough to deal with the unexpected. It recognizes that failure is a fact of life and embraces it rather than trying to ignore it.
Handling failure is, in many ways, another aspect ...