Chapter 7. Thinking About Resilience
Justin Li
In resilient systems, important variables stay in their desired state even when other variables leave their normal state. For example, many animals are able to avoid dying from minor cuts. When skin is cut, unprotected blood-carrying tissue is exposed, yet blood loss quickly trends back to zero as a clot forms. Improving a system’s resilience makes dependent variables describing that system more independent.
Networked systems are often required to respond quickly, expressed as a state like this: 99th percentile latency below one second. Ideally, this is held true all the way to the required limits of the system, for instance, 1s peak request rate of 100000 per second. We want to ensure that the latency variable isn’t too dependent on the request rate variable.
Here are ways we improve resilience:
- Load reduction
- Throttling, load shedding/prioritization, queuing, load balancing
- Latency reduction
- Caching, regional replication
- Load adaptation
- Autoscaling, overprovisioning
- Resilience (specifically)
- Timeouts, circuit breakers, bulkheads, retries, failovers, fallbacks
- Meta-techniques
- Improving tooling, perhaps to scale up or fail over faster; especially impactful in cases when slow humans are in a system’s critical path
Some of these tools are not usually associated with resilience (they are general optimization techniques), but ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access