Chapter 9. Game Days
A habit that is easy yet dangerous to fall into is to build recovery plans and disaster plans, and then shove them in a drawer and ignore them until they are needed.
If you do that, it is almost guaranteed that by the time you need the recovery/disaster plans, they will be incorrect or out of date. In addition, if you do not keep them up to date, you open up the possibility for a number of other problems to be introduced, making the plans impossible or impractical to implement successfully.
As such, you should plan to test your recovery/disaster plans on a regular basis. It should become part of your company culture to regularly test these plans and other risk mitigations.
One model for testing these plans is to run Game Days. A Game Day is when you test invoking a specific failure mode into your system and watch to see how your operators and engineers respond to it, including how they implement any recovery/disaster plans. After the Game Day, a postmortem review will uncover changes and issues with your plans that need to be made. These changes will keep your plans fresh and updated, and ready to be used when a real problem occurs.
Staging Versus Production Environments
You might be wondering whether you should test recovery plans on a staging environment or on your live production application. This is a tough question and it does not have a simple answer. Let’s take a closer look at each of these options:
- Staging/test environments
Testing recovery plans ...