book

Chaos Engineering

by Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri

August 2017

Intermediate to advanced

71 pages

1h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

How Does Chaos Engineering Differ from Testing?It’s Not Just for NetflixPrerequisites for Chaos Engineering
Understanding Complex SystemsExample of Systemic ComplexityTakeaway from the Example
ExperimentationAdvanced Principles
Characterizing Steady StateForming Hypotheses
State and ServicesInput in ProductionOther People’s SystemsAgents Making ChangesExternal ValidityPoor Excuses for Not Practicing ChaosI’m pretty sure it will break!If it does break, we’re in big trouble!Get as Close as You Can
Automatically Executing ExperimentsAutomatically Creating Experiments

1. Pick a Hypothesis2. Choose the Scope of the Experiment3. Identify the Metrics You’re Going to Watch4. Notify the Organization5. Run the Experiment6. Analyze the Results7. Increase the Scope8. Automate
SophisticationAdoptionDraw the Map
Resources

Content preview from Chaos Engineering

Chapter 4. Vary Real-World Events

Every system, from simple to complex, is subject to unpredictable events and conditions if it runs long enough. Examples include increase in load, hardware malfunction, deployment of faulty software, and the introduction of invalid data (sometimes known as poison data). We don’t have a way to exhaustively enumerate all of the events or conditions we might want to consider, but common ones fall under the following categories:

Hardware failures
Functional bugs
State transmission errors (e.g., inconsistency of states between sender and receiver nodes)
Network latency and partition
Large fluctuations in input (up or down) and retry storms
Resource exhaustion
Unusual or unpredictable combinations of interservice communication
Byzantine failures (e.g., a node believing it has the most current data when it actually does not)
Race conditions
Downstream dependencies malfunction

Perhaps most interesting are the combinations of events listed above that cause adverse systemic behaviors.

It is not possible to prevent threats to availability, but it is possible to mitigate them. In deciding which events to induce, estimate the frequency and impact of the events and weigh them against the costs and complexity. At Netflix, we turn off machines because instance termination happens frequently in the wild and the act of turning off a server is cheap and easy. We simulate regional failures even though to do so is costly and complex, because ...