Chapter 9. Game Days

A habit that is easy yet dangerous to fall into is to build recovery plans and disaster plans, and then shove them in a drawer and ignore them until they are needed.

If you do that, it is almost guaranteed that by the time you need the recovery/disaster plans, they will be incorrect or out of date. In addition, if you do not keep them up to date, you open up the possibility for a number of other problems to be introduced, making the plans impossible or impractical to implement successfully.

As such, you should plan to test your recovery/disaster plans on a regular basis. It should become part of your company culture to regularly test these plans and other risk mitigations.

One model for testing these plans is to run Game Days. A Game Day is when you test invoking a specific failure mode into your system and watch to see how your operators and engineers respond to it, including how they implement any recovery/disaster plans. After the Game Day, a postmortem review will uncover changes and issues with your plans that need to be made. These changes will keep your plans fresh and updated, and ready to be used when a real problem occurs.

Staging Versus Production Environments

You might be wondering whether you should test recovery plans on a staging environment or on your live production application. This is a tough question and it does not have a simple answer. Let’s take a closer look at each of these options:

Staging/test environments

Testing recovery plans ...

Get Architecting for Scale now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.