Now that we’ve covered the principles, let’s talk about the nitty gritty of designing your Chaos Engineering experiments. Here’s an overview of the process:
Pick a hypothesis
Choose the scope of the experiment
Identify the metrics you’re going to watch
Notify the organization
Run the experiment
Analyze the results
Increase the scope
The first thing you need to do is decide what hypothesis you’re going to test, which we covered in the section Chapter 4. Perhaps you recently had an outage that was triggered by timeouts when accessing one of your Redis caches, and you want to ensure that your system is vulnerable to timeouts in any of the other caches in your system. Or perhaps you’d like to verify that your active-passive database configuration fails over cleanly when the primary database server encounters a problem.
Don’t forget that your system includes the humans that are involved in maintaining it. Human behavior is critical in mitigating outages. Consider an organization that uses a messaging app such as Slack or HipChat to communicate during an incident. The organization may have a contingency plan for handling the outage when the messaging app is down during an outage, but how well do the on-call engineers know the contingency plan? Running a chaos experiment is a great way to find out.
Once you’ve chosen what hypothesis you want to test, the next thing ...