Chapter 13. Designing for fault-tolerance
This chapter covers
- What fault-tolerance is and why you need it
- Using redundancy to remove single point of failures
- Retrying on failure
- Using idempotent operations to achieve retry on failure
- AWS service guarantees
Failure is inevitable for hard disks, networks, power, and so on. Fault-tolerance deals with that problem. A fault-tolerant system is built for failure. If a failure occurs, the system isn’t interrupted, and it continues to handle requests. If your system has a single point of failure, it’s not fault-tolerant. You can achieve fault-tolerance by introducing redundancy into your system and by decoupling the parts of your system in such a way that one side doesn’t rely on the uptime of ...