Chapter 7. Minimize Blast Radius
On April 26, 1986, one of the worst nuclear accidents in human history occurred at the Chernobyl nuclear power plant in the Ukraine. Ironically, the disaster was triggered by a resiliency exercise: an experimental attempt to verify a redundant power source for coolant pumps. While most of us don’t work on projects where the stakes are as high as a nuclear power plant coolant system, each chaos experiment has the potential to cause a production outage. The professional responsibility of the chaos engineer is to understand and mitigate production risks. A well-designed system for experimentation will prevent big production outages by causing only a few customers a small amount of pain.
Unfortunately, we have run experiments that were supposed to only impact a small percentage of users but cascading failures unintentionally impacted more users than intended. In these cases, we had to perform an emergency stop of the experiment. While we never want to cause such a situation, the ability to contain and halt the experiment prevented a much larger crisis. In many ways, our experiments are looking for the unknown and unforeseen repercussions of failure, so the trick is how to shed light on these vulnerabilities without accidentally blowing everything up. We call this “minimizing the blast radius.”
The most confidence-inspiring and risky experiments impact all production traffic. Chaos experiments should take careful, measured risks that build upon each ...