Chapter 12. The Experiment Selection Problem (and a Solution)
It is hard to imagine a large-scale, real-world system that does not involve the interaction of people and machines. When we design such a system, often the hardest (and most important) part is figuring out how best to use the two different kinds of resources. In this chapter, I make the case that the resiliency community should rethink how it leverages humans and computers as resources. Specifically, I argue that the problem of developing intuition about system failure modes using observability infrastructure, and ultimately discharging those intuitions in the form of chaos experiments, is a role better played by a computer than by a person. Finally, I provide some evidence that the community is ready to move in this direction.
Independent from (and complementary to) the methodologies discussed in the rest of the book is the problem of experiment selection: choosing which faults to inject into which system executions. As we have seen, choosing the right experiments can mean identifying bugs before our users do, as well as learning new things about the behavior of our distributed system at scale. Unfortunately, due to the inherent complexity of such systems, the number of possible distinct experiments that we could run is astronomical—exponential in the number of communicating instances. For example, suppose we wanted to exhaustively test the effect of every possible combination of ...