Chapter 9. Chaos and Operations
If chaos engineering were just about surfacing evidence of system weaknesses through Game Days and automated chaos experiments, then life would be less complicated. Less complicated, but also much less safe!
In the case of Game Days, much safety can be achieved by executing the Game Day against a sandbox environment and ensuring that everyone—participants, observers, and external parties—is aware the Game Day is happening.1
The challenge is harder with automated chaos experiments. Automated experiments could potentially be executed by anyone, at any time, and possibly against any system.2 There are two main categories of operational concern when it comes to your automated chaos experiments (Figure 9-1):
- Control
-
You or other members of your team may want to seize control of a running experiment. For example you may want to shut it down immediately, or you may just want to be asked whether a particularly dangerous step in the experiment should be executed or skipped.
- Observation
-
You want your experiment to be debuggable as it runs in production. You should be able to see what experiments are currently running, and what step they have just executed, and then trace that back to how other elements of your system are executing in parallel.
Figure 9-1. The control and observation operational concerns of a running automated chaos experiment
There ...