Chapter 40. Test Your Infrastructure with Game Days

Fernando Duran

“You don’t have a backup until you have performed a restore” is a good aphorism, and in a similar way, we can say that your service or infrastructure is not fully resilient if you haven’t tried breaking it and recovering.

A game day is a planned rehearsal exercise in which a team tries to recover from an incident. It tests your readiness and reliability in the face of an emergency in a production environment.

The motivation is for the teams and the code to be ready when incidents occur; therefore, you want the test incident to resemble a real-life incident. When you run these experiments in production environments and in an automated way, this is called chaos engineering.

There are several items to consider when preparing for a game day. Most importantly, you need to decide whether you are running the exercise in production. This is the ideal, since any staging or test environments are never really going to be the same as production. But on the other hand, you have to comply with your SLAs, obtain approval, and warn customers if needed. If you have never done a game day or if the target system has never been tested for disruption, then start with a test environment.

Another decision is whether you want the procedure to be planned and triggered by an adversarial red team—in this ...

Get 97 Things Every Cloud Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.