Learning Chaos Engineering

Book description

Most companies work hard to avoid costly failures, but in complex systems a better approach is to embrace and learn from them. Through chaos engineering, you can proactively hunt for evidence of system weaknesses before they trigger a crisis. This practical book shows software developers and system administrators how to plan and run successful chaos engineering experiments.

System weaknesses go beyond your infrastructure, platforms, and applications to include policies, practices, playbooks, and people. Author Russ Miles explains why, when, and how to test systems, processes, and team responses using simulated failures on Game Days. You’ll also learn how to work toward continuous chaos through automation with features you can share across your team and organization.

  • Learn to think like a chaos engineer
  • Build a hypothesis backlog to determine what could go wrong in your system
  • Develop your hypotheses into chaos engineering experiment Game Days
  • Write, run, and learn from automated chaos experiments using the open source Chaos Toolkit
  • Turn chaos experiments into tests to confirm that you’ve overcome the weaknesses you discovered
  • Observe and control your automated chaos experiments while they are running

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Audience
    2. What This Book Is About
    3. What This Book Is Not About
    4. About the Samples
    5. Conventions Used in This Book
    6. Using Code Examples
    7. O’Reilly Online Learning
    8. How to Contact Us
    9. Acknowledgments
  2. I. Chaos Engineering Fundamentals
  3. 1. Chaos Engineering Distilled
    1. Chaos Engineering Defined
      1. Chaos Engineering Addresses the Whole Sociotechnical System
      2. Locations of Dark Debt
    2. The Process of Chaos Engineering
    3. The Practices of Chaos Engineering
      1. Sandbox/Staging or Production?
    4. Chaos Engineering and Observability
    5. Is There a “Chaos Engineer”?
    6. Summary
  4. 2. Building a Hypothesis Backlog
    1. Start with Experiments?
    2. Gathering Hypotheses
      1. Incident Analysis
      2. Sketching Your System
      3. Capturing “What Could Possibly Go Wrong?”
    3. Introducing Likelihood and Impact
      1. Building a Likelihood-Impact Map
      2. Adding What You Care About
    4. Creating Your Hypothesis Backlog
    5. Summary
  5. 3. Planning and Running a Manual Game Day
    1. What Is a Game Day?
    2. Planning Your Game Day
      1. Pick a Hypothesis
      2. Pick a Style of Game Day
      3. Decide Who Participates and Who Observes
      4. Decide Where
      5. Decide When and For How Long
      6. Describe Your Game Day Experiment
      7. Get Approval!
    3. Running the Game Day
      1. Consider a “Safety Monitor”
    4. Summary
  6. II. Chaos Engineering Automation
  7. 4. Getting Tooled Up for Automated Chaos Engineering
    1. Installing Python 3
    2. Installing the Chaos Toolkit CLI
    3. Summary
  8. 5. Writing and Running Your First Automated Chaos Experiment
    1. Setting Up the Sample Target System
      1. A Quick Tour of the Sample System
    2. Exploring and Discovering Evidence of Weaknesses
      1. Running Your Experiment
      2. Under the Skin of chaos run
      3. Steady-State Deviation Might Indicate “Opportunity for Improvement”
    3. Improving the System
    4. Validating the Improvement
    5. Summary
  9. 6. Chaos Engineering from Beginning to End
    1. The Target System
      1. The Platform: A Three-Node Kubernetes Cluster
      2. The Application: A Single Service, Replicated Three Times
      3. The People: Application Team and Cluster Administrators
    2. Hunting for a Weakness
      1. Naming Your Experiment
      2. Defining Your Steady-State Hypothesis
      3. Injecting Turbulent Conditions in an Experiment’s Method
      4. Using the Kubernetes Driver from Your Method
    3. Being a Good Citizen with Rollbacks
    4. Bringing It All Together and Running Your Experiment
      1. Overcoming a Weakness: Applying a Disruption Budget
    5. Summary
  10. 7. Collaborative Chaos
    1. Sharing Experiment Definitions
      1. Moving Values into Configuration
      2. Specifying Configuration Properties as Environment Variables
      3. Externalizing Secrets
      4. Scoping Secrets
    2. Specifying a Contribution Model
    3. Creating and Sharing Human-Readable Chaos Experiment Reports
      1. Creating a Single-Experiment Execution Report
      2. Creating and Sharing a Multiple Experiment Execution Report
    4. Summary
  11. 8. Creating Custom Chaos Drivers
    1. Creating Your Own Custom Driver with No Custom Code
      1. Implementing Probes and Actions with HTTP Calls
      2. Implementing Probes and Actions Through Process Calls
    2. Creating Your Own Custom Chaos Driver in Python
      1. Creating a New Python Module for Your Chaos Toolkit Extension Project
      2. Adding the Probe
    3. Summary
  12. III. Chaos Engineering Operations
  13. 9. Chaos and Operations
    1. Experiment “Controls”
    2. Enabling Controls
      1. Enabling a Control Inline in an Experiment
      2. Enabling a Control Globally
    3. Summary
  14. 10. Implementing Chaos Engineering Observability
    1. Adding Logging to Your Chaos Experiments
      1. Centralized Chaos Logging in Action
    2. Tracing Your Chaos Experiments
      1. Introducing OpenTracing
      2. Applying the OpenTracing Control
    3. Summary
  15. 11. Human Intervention in Chaos Experiment Automation
    1. Creating a New Chaos Toolkit Extension for Your Controls
    2. Adding Your (Very) Simple Human Interaction Control
    3. Skipping or Executing an Experiment’s Activity
    4. Summary
  16. 12. Continuous Chaos
    1. What Is Continuous Chaos?
    2. Scheduling Continuous Chaos Using cron
      1. Creating a Script to Execute Your Chaos Tests
      2. Adding Your Chaos Tests Script to cron
    3. Scheduling Continuous Chaos with Jenkins
      1. Grabbing a Copy of Jenkins
      2. Adding Your Chaos Tests to a Jenkins Build
      3. Scheduling Your Chaos Tests in Jenkins with Build Triggers
    4. Summary
  17. A. Chaos Toolkit Reference
    1. The Default Chaos Commands
      1. Discovering What’s Possible with the chaos discover Command
      2. Authoring a New Experiment with the chaos init Command
      3. Checking Your Experiment with the chaos validate Command
    2. Extending the Chaos Commands with Plug-ins
  18. B. The Chaos Toolkit Community Playground
  19. Index

Product information

  • Title: Learning Chaos Engineering
  • Author(s): Russ Miles
  • Release date: July 2019
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492051008