Chapter 8. Designing Experiments

Now that we’ve covered the principles, let’s talk about the nitty gritty of designing your Chaos Engineering experiments. Here’s an overview of the process:

  1. Pick a hypothesis

  2. Choose the scope of the experiment

  3. Identify the metrics you’re going to watch

  4. Notify the organization

  5. Run the experiment

  6. Analyze the results

  7. Increase the scope

  8. Automate

1. Pick a Hypothesis

The first thing you need to do is decide what hypothesis you’re going to test, which we covered in the section Chapter 4. Perhaps you recently had an outage that was triggered by timeouts when accessing one of your Redis caches, and you want to ensure that your system is vulnerable to timeouts in any of the other caches in your system. Or perhaps you’d like to verify that your active-passive database configuration fails over cleanly when the primary database server encounters a problem.

Don’t forget that your system includes the humans that are involved in maintaining it. Human behavior is critical in mitigating outages. Consider an organization that uses a messaging app such as Slack or HipChat to communicate during an incident. The organization may have a contingency plan for handling the outage when the messaging app is down during an outage, but how well do the on-call engineers know the contingency plan? Running a chaos experiment is a great way to find out.

2. Choose the Scope of the Experiment

Once you’ve chosen what hypothesis you want to test, the next thing ...

Get Chaos Engineering now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.