Chapter 3. Planning and Running a Manual Game Day

How many times have you heard comments like the following after a production incident?

“We were not prepared for that!”

“Dashboards were lighting up, we didn’t know where to look...”

“Alarms were going off, and we didn’t know why...”

The absolute worst time to try and learn about the weaknesses in your sociotechnical system is during an incident. There’s panic; there’s stress; there may even be anger. It’s hardly the time to be the person that suggests, “Shall we just step back a moment and ask ourselves how this all happened?” “It’s a little too late, it’s happening!” would be the reply, if you’re lucky and people are feeling inordinately polite.

Chaos engineering has the single goal of helping you collect evidence of system weaknesses before those weaknesses become incidents. While no guarantees can be made that you’ll ever find every one of the multitude of compound potential weaknesses in a system, or even just all the catastrophic ones, it is good engineering sense to be proactive about exploring your system’s weaknesses ahead of time.

So far you have hypotheses of how your system should respond in the event of turbulent conditions. The next step is to grab some tools and start breaking things in production, right? Wrong!

The cheapest way1 to get started with chaos engineering requires no tools. It requires your effort, your time, your team’s time, and ideally the time of anyone who has a stake in your system’s reliability. ...

Get Learning Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.