book

Chaos Engineering

by Casey Rosenthal, Nora Jones

April 2020

Intermediate to advanced

305 pages

8h 45m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
Introduction: Birth of Chaos
Management Principles as CodeChaos Monkey Is BornGoing BigFormalizing the DisciplineCommunity Is BornFast Evolution
I. Setting the Stage
1. Encountering Complex Systems
Contemplating ComplexityEncountering ComplexityExample 1: Mismatch Between Business Logic and Application LogicExample 2: Customer-Induced Retry StormExample 3: Holiday Code FreezeConfronting ComplexityAccidental ComplexityEssential ComplexityEmbracing Complexity
2. Navigating Complex Systems
Dynamic Safety ModelEconomicsWorkloadSafetyEconomic Pillars of ComplexityStateRelationshipsEnvironmentReversibilityEconomic Pillars of Complexity Applied to SoftwareThe Systemic Perspective
3. Overview of Principles
What Chaos Engineering IsExperimentation Versus TestingVerification Versus ValidationWhat Chaos Engineering Is NotBreaking StuffAntifragilityAdvanced PrinciplesBuild a Hypothesis Around Steady-State BehaviorVary Real-World EventsRun Experiments in ProductionAutomate Experiments to Run ContinuouslyMinimize Blast RadiusThe Future of “The Principles”
II. Principles in Action
4. Slack’s Disasterpiece Theater
Retrofitting ChaosDesign Patterns Common in Older SystemsDesign Patterns Common in Newer SystemsGetting to Basic Fault ToleranceDisasterpiece TheaterGoalsAnti-GoalsThe ProcessPreparationThe ExerciseDebriefingHow the Process Has EvolvedGetting Management Buy-InResultsAvoid Cache InconsistencyTry, Try Again (for Safety)Impossibility ResultConclusion
5. Google DiRT: Disaster Recovery Testing
Life of a DiRT TestThe Rules of EngagementWhat to TestHow to TestGathering ResultsScope of Tests at GoogleConclusion
6. Microsoft Variation and Prioritization of Experiments
Why Is Everything So Complicated?An Example of Unexpected ComplicationsA Simple System Is the Tip of the IcebergCategories of Experiment OutcomesKnown Events/Unexpected ConsequencesUnknown Events/Unexpected ConsequencesPrioritization of FailuresExplore DependenciesDegree of VariationVarying FailuresCombining Variation and PrioritizationExpanding Variation to DependenciesDeploying Experiments at ScaleConclusion

7. LinkedIn Being Mindful of Members
Learning from DisasterGranularly Targeting ExperimentsExperimenting at Scale, SafelyIn Practice: LinkedOutFailure ModesUsing LiX to Target ExperimentsBrowser Extension for Rapid ExperimentationAutomated ExperimentationConclusion
8. Capital One Adoption and Evolution of Chaos Engineering
A Capital One Case StudyBlind Resiliency TestingTransition to Chaos EngineeringChaos Experiments in CI/CDThings to Watch Out for While Designing the ExperimentToolingTeam StructureEvangelismConclusion
III. Human Factors
9. Creating Foresight
Chaos Engineering and ResilienceSteps of the Chaos Engineering CycleDesigning the ExperimentTool Support for Chaos Experiment DesignEffectively Partnering InternallyUnderstand Operating ProceduresDiscuss ScopeHypothesizeConclusion
10. Humanistic Chaos
Humans in the SystemPutting the “Socio” in Sociotechnical SystemsOrganizations Are a System of SystemsEngineering Adaptive CapacitySpotting Weak SignalsFailure and Success, Two Sides of the Same CoinPutting the Principles into PracticeBuild a HypothesisVary Real-World EventsMinimize the Blast RadiusCase Study 1: Gaming Your Game DaysCommunication: The Network Latency of Any OrganizationCase Study 2: Connecting the DotsLeadership Is an Emergent Property of the SystemCase Study 3: Changing a Basic AssumptionSafely Organizing the ChaosAll You Need Is Altitude and a DirectionClose the LoopsIf You’re Not Failing, You’re Not Learning
11. People in the Loop
The Why, How, and When of ExperimentsThe WhyThe HowThe WhenFunctional Allocation, or Humans-Are-Better-At/Machines-Are-Better-AtThe Substitution MythConclusion
12. The Experiment Selection Problem (and a Solution)
Choosing ExperimentsRandom SearchThe Age of the ExpertsObservability: The OpportunityObservability for Intuition EngineeringConclusion
IV. Business Factors
13. ROI of Chaos Engineering
Ephemeral Nature of Incident ReductionKirkpatrick ModelLevel 1: ReactionLevel 2: LearningLevel 3: TransferLevel 4: ResultsAlternative ROI ExampleCollateral ROIConclusion
14. Open Minds, Open Science, and Open Chaos
Collaborative MindsetsOpen Science; Open SourceOpen Chaos ExperimentsExperiment Findings, Shareable ResultsConclusion
15. Chaos Maturity Model
AdoptionWho Bought into Chaos EngineeringHow Much of the Organization Participates in Chaos EngineeringPrerequisitesObstacles to AdoptionSophisticationPutting It All Together
V. Evolution
16. Continuous Verification
Where CV Comes FromTypes of CV SystemsCV in the Wild: ChAPChAP: Selecting ExperimentsChAP: Running ExperimentsThe Advanced Principles in ChAPChAP as Continuous VerificationCV Coming Soon to a System Near YouPerformance TestingData ArtifactsCorrectness
17. Let’s Get Cyber-Physical
The Rise of Cyber-Physical SystemsFunctional Safety Meets Chaos EngineeringFMEA and Chaos EngineeringSoftware in Cyber-Physical SystemsChaos Engineering as a Step Beyond FMEAProbe EffectAddressing the Probe EffectConclusion
18. HOP Meets Chaos Engineering
What Is Human and Organizational Performance (HOP)?Key Principles of HOPPrinciple 1: Error Is NormalPrinciple 2: Blame Fixes NothingPrinciple 3: Context Drives BehaviorPrinciple 4: Learning and Improving Is VitalPrinciple 5: Intentional Response MattersHOP Meets Chaos EngineeringChaos Engineering and HOP in PracticeConclusion
19. Chaos Engineering on a Database
Why Do We Need Chaos Engineering?Robustness and StabilityA Real-World ExampleApplying Chaos EngineeringOur Way of Embracing ChaosFault InjectionFault Injection in ApplicationsFault Injection in CPU and MemoryFault Injection in the NetworkFault Injection in the FilesystemDetecting FailuresAutomating ChaosAutomated Experimentation Platform: SchrodingerSchrodinger WorkflowConclusion
20. The Case for Security Chaos Engineering
A Modern Approach to SecurityHuman Factors and FailureRemove the Low-Hanging FruitFeedback LoopsSecurity Chaos Engineering and Current MethodsProblems with Red TeamingProblems with Purple TeamingBenefits of Security Chaos EngineeringSecurity Game DaysExample Security Chaos Engineering Tool: ChaoSlingrThe Story of ChaoSlingrConclusionContributors/Reviewers
21. Conclusion
Index

Content preview from Chaos Engineering

Chapter 6. Microsoft Variation and Prioritization of Experiments

Oleg Surmachev

At Microsoft we build and operate our own Chaos Engineering program for cloud infrastructure at scale. We find that experiment selection in particular has an outsized impact on the way you apply Chaos Engineering to your system. Examples of different failure scenarios in real production systems illustrate how a variety of real-world events can affect your production system. I’ll propose a method for prioritizing experimentation of your services, and then a framework for considering the variation of different experiment types. My goal in this chapter is to offer strategies you can apply in your engineering process to improve the reliability of your products.

Why Is Everything So Complicated?

Modern software systems are complex. There are hundreds, often thousands, of engineers working to enable even the smallest software product. There are thousands, maybe millions, of pieces of hardware and software that make up a single system that becomes your service. Think of all those engineers working for hardware providers like Intel, Samsung, Western Digital, and other companies designing and building server hardware. Think of Cisco, Arista, Dell, APC, and all other providers of network and power equipment. Think of Microsoft and Amazon providing you with the cloud platform. All of these dependencies you accept into your system explicitly or implicitly have their own dependencies in turn, all the way down ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492043850Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Chaos Engineering

by Casey Rosenthal, Nora Jones

Chapter 6. Microsoft Variation and Prioritization of Experiments

Why Is Everything So Complicated?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.