Chapter 5. Run Experiments in Production

In our field, the idea of doing software verification in a production environment is generally met with derision. “We’ll test it in prod” is a form of gallows humor, which translates to “we aren’t going to bother verifying this code properly before we deploy it.”

A commonly held tenet of classical testing is that it’s better to identify bugs as far away from production as possible. For example, it’s better to identify a bug in a unit test than in an integration test. The reasoning is that the farther away you are from a full deployment in the production environment, the easier it will be to identify the reason for the bug and fix it. If you’ve ever had to debug a failed unit test, a failed integration test, and a bug that manifested only in production, the wisdom in this approach is self-evident.

When it comes to Chaos Engineering, the strategy is reversed: you want to run your experiments as close to the production environment as possible. The ideal implementation runs all experiments directly in the production environment.

When we do traditional software testing, we’re verifying code correctness. We have a good sense about how functions and methods are supposed to behave and we write tests to verify the behaviors of these components.

When we run Chaos Engineering experiments, we are interested in the behavior of the entire overall system. The code is an important part of the system, but there’s a lot more to our system than just code. ...

Get Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.