Chapter 2. Practicing Incident Response Readiness (Preparedness)
We talked about the stages of managing an incident and the incident management lifecycle. Now let’s discuss how to practice incident management so that you can be ready when a real incident strikes.
Disaster Role-Playing and Incident Response Exercises
There is value in testing and practicing incident response readiness in order to increase resilience. We recommend implementing disaster role-playing in your team to train for incident response. At Google, we often refer to this as Wheel of Misfortune.1 One way to do this is to re-create scenarios from real production incidents you encountered in the past.
There are tangible benefits to running regular incident response exercises. In the earlier days of Google’s Disaster Resilience Testing (DiRT) program, there were tests deemed too risky to be executed. Over the years, by focusing on the areas exposed by those too-risky-to-run tests, many of these risks have been addressed so thoroughly that the tests are now automated and considered uninteresting.
Getting to that point wasn’t immediate or painless—it took time and a lot of effort from several teams to get there—but we’ve been able to reduce significant risks in the global system to “just another automated test that runs periodically.”2
Regular Testing
There are tangible benefits to regular testing. For years, Google has been running DiRT tests to find and remediate problems with our production systems. As teams ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access