Chaos Engineering

Book description

As more companies move toward microservices and other distributed technologies, the complexity of these systems increases. You can't remove the complexity, but through Chaos Engineering you can discover vulnerabilities and prevent outages before they impact your customers. This practical guide shows engineers how to navigate complex systems while optimizing to meet business goals.

Two of the field's prominent figures, Casey Rosenthal and Nora Jones, pioneered the discipline while working together at Netflix. In this book, they expound on the what, how, and why of Chaos Engineering while facilitating a conversation from practitioners across industries. Many chapters are written by contributing authors to widen the perspective across verticals within (and beyond) the software industry.

  • Learn how Chaos Engineering enables your organization to navigate complexity
  • Explore a methodology to avoid failures within your application, network, and infrastructure
  • Move from theory to practice through real-world stories from industry experts at Google, Microsoft, Slack, and LinkedIn, among others
  • Establish a framework for thinking about complexity within software systems
  • Design a Chaos Engineering program around game days and move toward highly targeted, automated experiments
  • Learn how to design continuous collaborative chaos experiments

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. O’Reilly Online Learning
    3. How to Contact Us
    4. Acknowledgments
  2. Introduction: Birth of Chaos
    1. Management Principles as Code
    2. Chaos Monkey Is Born
    3. Going Big
    4. Formalizing the Discipline
    5. Community Is Born
    6. Fast Evolution
  3. I. Setting the Stage
  4. 1. Encountering Complex Systems
    1. Contemplating Complexity
    2. Encountering Complexity
      1. Example 1: Mismatch Between Business Logic and Application Logic
      2. Example 2: Customer-Induced Retry Storm
      3. Example 3: Holiday Code Freeze
    3. Confronting Complexity
      1. Accidental Complexity
      2. Essential Complexity
    4. Embracing Complexity
  5. 2. Navigating Complex Systems
    1. Dynamic Safety Model
      1. Economics
      2. Workload
      3. Safety
    2. Economic Pillars of Complexity
      1. State
      2. Relationships
      3. Environment
      4. Reversibility
      5. Economic Pillars of Complexity Applied to Software
    3. The Systemic Perspective
  6. 3. Overview of Principles
    1. What Chaos Engineering Is
      1. Experimentation Versus Testing
      2. Verification Versus Validation
    2. What Chaos Engineering Is Not
      1. Breaking Stuff
      2. Antifragility
    3. Advanced Principles
      1. Build a Hypothesis Around Steady-State Behavior
      2. Vary Real-World Events
      3. Run Experiments in Production
      4. Automate Experiments to Run Continuously
      5. Minimize Blast Radius
    4. The Future of “The Principles”
  7. II. Principles in Action
  8. 4. Slack’s Disasterpiece Theater
    1. Retrofitting Chaos
      1. Design Patterns Common in Older Systems
      2. Design Patterns Common in Newer Systems
      3. Getting to Basic Fault Tolerance
    2. Disasterpiece Theater
      1. Goals
      2. Anti-Goals
    3. The Process
      1. Preparation
      2. The Exercise
      3. Debriefing
    4. How the Process Has Evolved
    5. Getting Management Buy-In
    6. Results
      1. Avoid Cache Inconsistency
      2. Try, Try Again (for Safety)
      3. Impossibility Result
    7. Conclusion
  9. 5. Google DiRT: Disaster Recovery Testing
    1. Life of a DiRT Test
      1. The Rules of Engagement
      2. What to Test
      3. How to Test
      4. Gathering Results
    2. Scope of Tests at Google
    3. Conclusion
  10. 6. Microsoft Variation and Prioritization of Experiments
    1. Why Is Everything So Complicated?
      1. An Example of Unexpected Complications
      2. A Simple System Is the Tip of the Iceberg
    2. Categories of Experiment Outcomes
      1. Known Events/Unexpected Consequences
      2. Unknown Events/Unexpected Consequences
    3. Prioritization of Failures
      1. Explore Dependencies
    4. Degree of Variation
      1. Varying Failures
      2. Combining Variation and Prioritization
      3. Expanding Variation to Dependencies
    5. Deploying Experiments at Scale
    6. Conclusion
  11. 7. LinkedIn Being Mindful of Members
    1. Learning from Disaster
    2. Granularly Targeting Experiments
    3. Experimenting at Scale, Safely
    4. In Practice: LinkedOut
      1. Failure Modes
      2. Using LiX to Target Experiments
      3. Browser Extension for Rapid Experimentation
      4. Automated Experimentation
    5. Conclusion
  12. 8. Capital One Adoption and Evolution of Chaos Engineering
    1. A Capital One Case Study
      1. Blind Resiliency Testing
      2. Transition to Chaos Engineering
      3. Chaos Experiments in CI/CD
    2. Things to Watch Out for While Designing the Experiment
    3. Tooling
    4. Team Structure
    5. Evangelism
    6. Conclusion
  13. III. Human Factors
  14. 9. Creating Foresight
    1. Chaos Engineering and Resilience
    2. Steps of the Chaos Engineering Cycle
      1. Designing the Experiment
    3. Tool Support for Chaos Experiment Design
    4. Effectively Partnering Internally
      1. Understand Operating Procedures
      2. Discuss Scope
      3. Hypothesize
    5. Conclusion
  15. 10. Humanistic Chaos
    1. Humans in the System
      1. Putting the “Socio” in Sociotechnical Systems
      2. Organizations Are a System of Systems
    2. Engineering Adaptive Capacity
      1. Spotting Weak Signals
      2. Failure and Success, Two Sides of the Same Coin
    3. Putting the Principles into Practice
      1. Build a Hypothesis
      2. Vary Real-World Events
      3. Minimize the Blast Radius
      4. Case Study 1: Gaming Your Game Days
      5. Communication: The Network Latency of Any Organization
      6. Case Study 2: Connecting the Dots
      7. Leadership Is an Emergent Property of the System
      8. Case Study 3: Changing a Basic Assumption
      9. Safely Organizing the Chaos
      10. All You Need Is Altitude and a Direction
      11. Close the Loops
      12. If You’re Not Failing, You’re Not Learning
  16. 11. People in the Loop
    1. The Why, How, and When of Experiments
      1. The Why
      2. The How
      3. The When
      4. Functional Allocation, or Humans-Are-Better-At/Machines-Are-Better-At
      5. The Substitution Myth
    2. Conclusion
  17. 12. The Experiment Selection Problem (and a Solution)
    1. Choosing Experiments
      1. Random Search
      2. The Age of the Experts
    2. Observability: The Opportunity
      1. Observability for Intuition Engineering
    3. Conclusion
  18. IV. Business Factors
  19. 13. ROI of Chaos Engineering
    1. Ephemeral Nature of Incident Reduction
    2. Kirkpatrick Model
      1. Level 1: Reaction
      2. Level 2: Learning
      3. Level 3: Transfer
      4. Level 4: Results
    3. Alternative ROI Example
    4. Collateral ROI
    5. Conclusion
  20. 14. Open Minds, Open Science, and Open Chaos
    1. Collaborative Mindsets
    2. Open Science; Open Source
      1. Open Chaos Experiments
      2. Experiment Findings, Shareable Results
    3. Conclusion
  21. 15. Chaos Maturity Model
    1. Adoption
      1. Who Bought into Chaos Engineering
      2. How Much of the Organization Participates in Chaos Engineering
      3. Prerequisites
      4. Obstacles to Adoption
      5. Sophistication
    2. Putting It All Together
  22. V. Evolution
  23. 16. Continuous Verification
    1. Where CV Comes From
    2. Types of CV Systems
    3. CV in the Wild: ChAP
      1. ChAP: Selecting Experiments
      2. ChAP: Running Experiments
      3. The Advanced Principles in ChAP
      4. ChAP as Continuous Verification
    4. CV Coming Soon to a System Near You
      1. Performance Testing
      2. Data Artifacts
      3. Correctness
  24. 17. Let’s Get Cyber-Physical
    1. The Rise of Cyber-Physical Systems
    2. Functional Safety Meets Chaos Engineering
      1. FMEA and Chaos Engineering
    3. Software in Cyber-Physical Systems
    4. Chaos Engineering as a Step Beyond FMEA
    5. Probe Effect
      1. Addressing the Probe Effect
    6. Conclusion
  25. 18. HOP Meets Chaos Engineering
    1. What Is Human and Organizational Performance (HOP)?
    2. Key Principles of HOP
      1. Principle 1: Error Is Normal
      2. Principle 2: Blame Fixes Nothing
      3. Principle 3: Context Drives Behavior
      4. Principle 4: Learning and Improving Is Vital
      5. Principle 5: Intentional Response Matters
    3. HOP Meets Chaos Engineering
      1. Chaos Engineering and HOP in Practice
    4. Conclusion
  26. 19. Chaos Engineering on a Database
    1. Why Do We Need Chaos Engineering?
      1. Robustness and Stability
      2. A Real-World Example
    2. Applying Chaos Engineering
      1. Our Way of Embracing Chaos
      2. Fault Injection
      3. Fault Injection in Applications
      4. Fault Injection in CPU and Memory
      5. Fault Injection in the Network
      6. Fault Injection in the Filesystem
    3. Detecting Failures
    4. Automating Chaos
      1. Automated Experimentation Platform: Schrodinger
      2. Schrodinger Workflow
    5. Conclusion
  27. 20. The Case for Security Chaos Engineering
    1. A Modern Approach to Security
      1. Human Factors and Failure
      2. Remove the Low-Hanging Fruit
      3. Feedback Loops
    2. Security Chaos Engineering and Current Methods
      1. Problems with Red Teaming
      2. Problems with Purple Teaming
      3. Benefits of Security Chaos Engineering
    3. Security Game Days
    4. Example Security Chaos Engineering Tool: ChaoSlingr
      1. The Story of ChaoSlingr
    5. Conclusion
    6. Contributors/Reviewers
  28. 21. Conclusion
  29. Index

Product information

  • Title: Chaos Engineering
  • Author(s): Casey Rosenthal, Nora Jones
  • Release date: April 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492043867