book

Chaos Engineering

by Casey Rosenthal, Nora Jones

April 2020

Intermediate to advanced

305 pages

8h 45m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Conventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
Management Principles as CodeChaos Monkey Is BornGoing BigFormalizing the DisciplineCommunity Is BornFast Evolution
Contemplating ComplexityEncountering ComplexityExample 1: Mismatch Between Business Logic and Application LogicExample 2: Customer-Induced Retry StormExample 3: Holiday Code FreezeConfronting ComplexityAccidental ComplexityEssential ComplexityEmbracing Complexity
Dynamic Safety ModelEconomicsWorkloadSafetyEconomic Pillars of ComplexityStateRelationshipsEnvironmentReversibilityEconomic Pillars of Complexity Applied to SoftwareThe Systemic Perspective
What Chaos Engineering IsExperimentation Versus TestingVerification Versus ValidationWhat Chaos Engineering Is NotBreaking StuffAntifragilityAdvanced PrinciplesBuild a Hypothesis Around Steady-State BehaviorVary Real-World EventsRun Experiments in ProductionAutomate Experiments to Run ContinuouslyMinimize Blast RadiusThe Future of “The Principles”
Retrofitting ChaosDesign Patterns Common in Older SystemsDesign Patterns Common in Newer SystemsGetting to Basic Fault ToleranceDisasterpiece TheaterGoalsAnti-GoalsThe ProcessPreparationThe ExerciseDebriefingHow the Process Has EvolvedGetting Management Buy-InResultsAvoid Cache InconsistencyTry, Try Again (for Safety)Impossibility ResultConclusion
Life of a DiRT TestThe Rules of EngagementWhat to TestHow to TestGathering ResultsScope of Tests at GoogleConclusion
Why Is Everything So Complicated?An Example of Unexpected ComplicationsA Simple System Is the Tip of the IcebergCategories of Experiment OutcomesKnown Events/Unexpected ConsequencesUnknown Events/Unexpected ConsequencesPrioritization of FailuresExplore DependenciesDegree of VariationVarying FailuresCombining Variation and PrioritizationExpanding Variation to DependenciesDeploying Experiments at ScaleConclusion

Learning from DisasterGranularly Targeting ExperimentsExperimenting at Scale, SafelyIn Practice: LinkedOutFailure ModesUsing LiX to Target ExperimentsBrowser Extension for Rapid ExperimentationAutomated ExperimentationConclusion
A Capital One Case StudyBlind Resiliency TestingTransition to Chaos EngineeringChaos Experiments in CI/CDThings to Watch Out for While Designing the ExperimentToolingTeam StructureEvangelismConclusion
Chaos Engineering and ResilienceSteps of the Chaos Engineering CycleDesigning the ExperimentTool Support for Chaos Experiment DesignEffectively Partnering InternallyUnderstand Operating ProceduresDiscuss ScopeHypothesizeConclusion
Humans in the SystemPutting the “Socio” in Sociotechnical SystemsOrganizations Are a System of SystemsEngineering Adaptive CapacitySpotting Weak SignalsFailure and Success, Two Sides of the Same CoinPutting the Principles into PracticeBuild a HypothesisVary Real-World EventsMinimize the Blast RadiusCase Study 1: Gaming Your Game DaysCommunication: The Network Latency of Any OrganizationCase Study 2: Connecting the DotsLeadership Is an Emergent Property of the SystemCase Study 3: Changing a Basic AssumptionSafely Organizing the ChaosAll You Need Is Altitude and a DirectionClose the LoopsIf You’re Not Failing, You’re Not Learning
The Why, How, and When of ExperimentsThe WhyThe HowThe WhenFunctional Allocation, or Humans-Are-Better-At/Machines-Are-Better-AtThe Substitution MythConclusion
Choosing ExperimentsRandom SearchThe Age of the ExpertsObservability: The OpportunityObservability for Intuition EngineeringConclusion
Ephemeral Nature of Incident ReductionKirkpatrick ModelLevel 1: ReactionLevel 2: LearningLevel 3: TransferLevel 4: ResultsAlternative ROI ExampleCollateral ROIConclusion
Collaborative MindsetsOpen Science; Open SourceOpen Chaos ExperimentsExperiment Findings, Shareable ResultsConclusion
AdoptionWho Bought into Chaos EngineeringHow Much of the Organization Participates in Chaos EngineeringPrerequisitesObstacles to AdoptionSophisticationPutting It All Together
Where CV Comes FromTypes of CV SystemsCV in the Wild: ChAPChAP: Selecting ExperimentsChAP: Running ExperimentsThe Advanced Principles in ChAPChAP as Continuous VerificationCV Coming Soon to a System Near YouPerformance TestingData ArtifactsCorrectness
The Rise of Cyber-Physical SystemsFunctional Safety Meets Chaos EngineeringFMEA and Chaos EngineeringSoftware in Cyber-Physical SystemsChaos Engineering as a Step Beyond FMEAProbe EffectAddressing the Probe EffectConclusion
What Is Human and Organizational Performance (HOP)?Key Principles of HOPPrinciple 1: Error Is NormalPrinciple 2: Blame Fixes NothingPrinciple 3: Context Drives BehaviorPrinciple 4: Learning and Improving Is VitalPrinciple 5: Intentional Response MattersHOP Meets Chaos EngineeringChaos Engineering and HOP in PracticeConclusion
Why Do We Need Chaos Engineering?Robustness and StabilityA Real-World ExampleApplying Chaos EngineeringOur Way of Embracing ChaosFault InjectionFault Injection in ApplicationsFault Injection in CPU and MemoryFault Injection in the NetworkFault Injection in the FilesystemDetecting FailuresAutomating ChaosAutomated Experimentation Platform: SchrodingerSchrodinger WorkflowConclusion
A Modern Approach to SecurityHuman Factors and FailureRemove the Low-Hanging FruitFeedback LoopsSecurity Chaos Engineering and Current MethodsProblems with Red TeamingProblems with Purple TeamingBenefits of Security Chaos EngineeringSecurity Game DaysExample Security Chaos Engineering Tool: ChaoSlingrThe Story of ChaoSlingrConclusionContributors/Reviewers

Content preview from Chaos Engineering

Chapter 12. The Experiment Selection Problem (and a Solution)

Peter Alvaro

It is hard to imagine a large-scale, real-world system that does not involve the interaction of people and machines. When we design such a system, often the hardest (and most important) part is figuring out how best to use the two different kinds of resources. In this chapter, I make the case that the resiliency community should rethink how it leverages humans and computers as resources. Specifically, I argue that the problem of developing intuition about system failure modes using observability infrastructure, and ultimately discharging those intuitions in the form of chaos experiments, is a role better played by a computer than by a person. Finally, I provide some evidence that the community is ready to move in this direction.

Choosing Experiments

Independent from (and complementary to) the methodologies discussed in the rest of the book is the problem of experiment selection: choosing which faults to inject into which system executions. As we have seen, choosing the right experiments can mean identifying bugs before our users do, as well as learning new things about the behavior of our distributed system at scale. Unfortunately, due to the inherent complexity of such systems, the number of possible distinct experiments that we could run is astronomical—exponential in the number of communicating instances. For example, suppose we wanted to exhaustively test the effect of every possible combination of ...