book

Chaos Engineering

by Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri

August 2017

Intermediate to advanced

71 pages

1h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

How Does Chaos Engineering Differ from Testing?It’s Not Just for NetflixPrerequisites for Chaos Engineering
Understanding Complex SystemsExample of Systemic ComplexityTakeaway from the Example
ExperimentationAdvanced Principles
Characterizing Steady StateForming Hypotheses
State and ServicesInput in ProductionOther People’s SystemsAgents Making ChangesExternal ValidityPoor Excuses for Not Practicing ChaosI’m pretty sure it will break!If it does break, we’re in big trouble!Get as Close as You Can
Automatically Executing ExperimentsAutomatically Creating Experiments

1. Pick a Hypothesis2. Choose the Scope of the Experiment3. Identify the Metrics You’re Going to Watch4. Notify the Organization5. Run the Experiment6. Analyze the Results7. Increase the Scope8. Automate
SophisticationAdoptionDraw the Map
Resources

Content preview from Chaos Engineering

Chapter 5. Run Experiments in Production

In our field, the idea of doing software verification in a production environment is generally met with derision. “We’ll test it in prod” is a form of gallows humor, which translates to “we aren’t going to bother verifying this code properly before we deploy it.”

A commonly held tenet of classical testing is that it’s better to identify bugs as far away from production as possible. For example, it’s better to identify a bug in a unit test than in an integration test. The reasoning is that the farther away you are from a full deployment in the production environment, the easier it will be to identify the reason for the bug and fix it. If you’ve ever had to debug a failed unit test, a failed integration test, and a bug that manifested only in production, the wisdom in this approach is self-evident.

When it comes to Chaos Engineering, the strategy is reversed: you want to run your experiments as close to the production environment as possible. The ideal implementation runs all experiments directly in the production environment.

When we do traditional software testing, we’re verifying code correctness. We have a good sense about how functions and methods are supposed to behave and we write tests to verify the behaviors of these components.

When we run Chaos Engineering experiments, we are interested in the behavior of the entire overall system. The code is an important part of the system, but there’s a lot more to our system than just code. ...