book

Chaos Engineering

by Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri

August 2017

Intermediate to advanced

71 pages

1h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

How Does Chaos Engineering Differ from Testing?It’s Not Just for NetflixPrerequisites for Chaos Engineering
Understanding Complex SystemsExample of Systemic ComplexityTakeaway from the Example
ExperimentationAdvanced Principles
Characterizing Steady StateForming Hypotheses
State and ServicesInput in ProductionOther People’s SystemsAgents Making ChangesExternal ValidityPoor Excuses for Not Practicing ChaosI’m pretty sure it will break!If it does break, we’re in big trouble!Get as Close as You Can
Automatically Executing ExperimentsAutomatically Creating Experiments

1. Pick a Hypothesis2. Choose the Scope of the Experiment3. Identify the Metrics You’re Going to Watch4. Notify the Organization5. Run the Experiment6. Analyze the Results7. Increase the Scope8. Automate
SophisticationAdoptionDraw the Map
Resources

Content preview from Chaos Engineering

Part I. Introduction

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Principles of Chaos

If you’ve ever run a distributed system in production, you know that unpredictable events are bound to happen. Distributed systems contain so many interacting components that the number of things that can go wrong is enormous. Hard disks can fail, the network can go down, a sudden surge in customer traffic can overload a functional component—the list goes on. All too often, these events trigger outages, poor performance, and other undesirable behaviors.

We’ll never be able to prevent all possible failure modes, but we can identify many of the weaknesses in our system before they are triggered by these events. When we do, we can fix them, preventing those future outages from ever happening. We can make the system more resilient and build confidence in it.

Chaos Engineering is a method of experimentation on infrastructure that brings systemic weaknesses to light. This empirical process of verification leads to more resilient systems, and builds confidence in the operational behavior of those systems.

Using Chaos Engineering may be as simple as manually running kill -9 on a box inside of your staging environment to simulate failure of a service. Or, it can be as sophisticated as automatically designing and carrying out experiments in a production enviroment against ...