Chaos Engineering

Book description

Auto engineers test the safety of a car by intentionally crashing it and carefully observing the results. Chaos engineering applies the same principles to software systems. In Chaos Engineering: Site reliability through controlled disruption, you’ll learn to run your applications and infrastructure through a series of tests that simulate real-life failures. You'll maximize the benefits of chaos engineering by learning to think like a chaos engineer, and how to design the proper experiments to ensure the reliability of your software. With examples that cover a whole spectrum of software, you'll be ready to run an intensive testing regime on anything from a simple WordPress site to a massive distributed system running on Kubernetes.

About the Technology
Can your network survive a devastating failure? Could an accident bring your day-to-day operations to a halt? Chaos engineering simulates infrastructure outages, component crashes, and other calamities to show how systems and staff respond. Testing systems in distress is the best way to ensure their future resilience, which is especially important for complex, large-scale applications with little room for downtime.

About the Book
Chaos Engineering teaches you to design and execute controlled experiments that uncover hidden problems. Learn to inject system-shaking failures that disrupt system calls, networking, APIs, and Kubernetes-based microservices infrastructures. To help you practice, the book includes a downloadable Linux VM image with a suite of preconfigured tools so you can experiment quickly—without risk.

What's Inside
  • Inject failure into processes, applications, and virtual machines
  • Test software running on Kubernetes
  • Work with both open source and legacy software
  • Simulate database connection latency
  • Test and improve your team’s failure response


About the Reader
Assumes Linux servers. Basic scripting skills required.

About the Author
Mikolaj Pawlikowski is a recognized authority on chaos engineering. He is the creator of the Kubernetes chaos engineering tool PowerfulSeal, and the networking visibility tool Goldpinger.

Quotes
The topics covered in this book are easy to follow and detailed. It provides a number of hands-on exercises to help the reader master chaos engineering.
- Kelum Prabath Senanayake, Echoworx

The book we needed to improve our system’s reliability and resilience.
- Hugo Cruz, People Driven Technology

An important topic if you want to find hidden problems in your large system. This book gives a really good foundation.
- Yuri Kushch, Amazon

One of the best books about in-depth infrastructure, troubleshooting complex systems, and chaos engineering that I’ve ever read.
- Lev Andelman, Terasky Cloud & Devops

Table of contents

  1. inside front cover
  2. Chaos Engineering
  3. Copyright
  4. dedication
  5. brief contents
  6. contents
  7. front matter
    1. foreword
    2. foreword
    3. preface
    4. acknowledgments
    5. about this book
    6. Who should read this book
    7. How this book is organized: a roadmap
    8. About the code
    9. liveBook discussion forum
    10. about the author
    11. about the cover illustration
  8. 1 Into the world of chaos engineering
    1. 1.1 What is chaos engineering?
    2. 1.2 Motivations for chaos engineering
      1. 1.2.1 Estimating risk and cost, and setting SLIs, SLOs, and SLAs
      2. 1.2.2 Testing a system as a whole
      3. 1.2.3 Finding emergent properties
    3. 1.3 Four steps to chaos engineering
      1. 1.3.1 Ensure observability
      2. 1.3.2 Define a steady state
      3. 1.3.3 Form a hypothesis
      4. 1.3.4 Run the experiment and prove (or refute) your hypothesis
    4. 1.4 What chaos engineering is not
    5. 1.5 A taste of chaos engineering
      1. 1.5.1 FizzBuzz as a service
      2. 1.5.2 A long, dark night
      3. 1.5.3 Postmortem
      4. 1.5.4 Chaos engineering in a nutshell
    6. Summary
  9. Part 1. Chaos engineering fundamentals
  10. 2 First cup of chaos and blast radius
    1. 2.1 Setup: Working with the code in this book
    2. 2.2 Scenario
    3. 2.3 Linux forensics 101
      1. 2.3.1 Exit codes
      2. 2.3.2 Killing processes
      3. 2.3.3 Out-Of-Memory Killer
    4. 2.4 The first chaos experiment
      1. 2.4.1 Ensure observability
      2. 2.4.2 Define a steady state
      3. 2.4.3 Form a hypothesis
      4. 2.4.4 Run the experiment
    5. 2.5 Blast radius
    6. 2.6 Digging deeper
      1. 2.6.1 Saving the world
    7. Summary
  11. 3 Observability
    1. 3.1 The app is slow
    2. 3.2 The USE method
    3. 3.3 Resources
      1. 3.3.1 System overview
      2. 3.3.2 Block I/O
      3. 3.3.3 Networking
      4. 3.3.4 RAM
      5. 3.3.5 CPU
      6. 3.3.6 OS
    4. 3.4 Application
      1. 3.4.1 cProfile
      2. 3.4.2 BCC and Python
    5. 3.5 Automation: Using time series
      1. 3.5.1 Prometheus and Grafana
    6. 3.6 Further reading
    7. Summary
  12. 4 Database trouble and testing in production
    1. 4.1 We’re doing WordPress
    2. 4.2 Weak links
      1. 4.2.1 Experiment 1: Slow disks
      2. 4.2.2 Experiment 2: Slow connection
    3. 4.3 Testing in production
    4. Summary
  13. Part 2. Chaos engineering in action
  14. 5 Poking Docker
    1. 5.1 My (Dockerized) app is slow!
      1. 5.1.1 Architecture
    2. 5.2 A brief history of Docker
      1. 5.2.1 Emulation, simulation, and virtualization
      2. 5.2.2 Virtual machines and containers
    3. 5.3 Linux containers and Docker
    4. 5.4 Peeking under Docker’s hood
      1. 5.4.1 Uprooting processes with chroot
      2. 5.4.2 Implementing a simple container(-ish) part 1: Using chroot
      3. 5.4.3 Experiment 1: Can one container prevent another one from writing to disk?
      4. 5.4.4 Isolating processes with Linux namespaces
      5. 5.4.5 Docker and namespaces
    5. 5.5 Experiment 2: Killing processes in a different PID namespace
      1. 5.5.1 Implementing a simple container(-ish) part 2: Namespaces
      2. 5.5.2 Limiting resource use of a process with cgroups
    6. 5.6 Experiment 3: Using all the CPU you can find!
    7. 5.7 Experiment 4: Using too much RAM
      1. 5.7.1 Implementing a simple container(-ish) part 3: Cgroups
    8. 5.8 Docker and networking
      1. 5.8.1 Capabilities and seccomp
    9. 5.9 Docker demystified
    10. 5.10 Fixing my (Dockerized) app that’s being slow
      1. 5.10.1 Booting up Meower
      2. 5.10.2 Why is the app slow?
    11. 5.11 Experiment 5: Network slowness for containers with Pumba
      1. 5.11.1 Pumba: Docker chaos engineering tool
      2. 5.11.2 Chaos experiment implementation
    12. 5.12 Other parts of the puzzle
      1. 5.12.1 Docker daemon restarts
      2. 5.12.2 Storage for image layers
      3. 5.12.3 Advanced networking
      4. 5.12.4 Security
    13. Summary
  15. 6 Who you gonna call? Syscall-busters!
    1. 6.1 Scenario: Congratulations on your promotion!
      1. 6.1.1 System X: If everyone is using it, but no one maintains it, is it abandonware?
    2. 6.2 A brief refresher on syscalls
      1. 6.2.1 Finding out about syscalls
      2. 6.2.2 Using the standard C library and glibc
    3. 6.3 How to observe a process’s syscalls
      1. 6.3.1 strace and sleep
      2. 6.3.2 strace and System X
      3. 6.3.3 strace’s problem: Overhead
      4. 6.3.4 BPF
      5. 6.3.5 Other options
    4. 6.4 Blocking syscalls for fun and profit part 1: strace
      1. 6.4.1 Experiment 1: Breaking the close syscall
      2. 6.4.2 Experiment 2: Breaking the write syscall
    5. 6.5 Blocking syscalls for fun and profit part 2: Seccomp
      1. 6.5.1 Seccomp the easy way with Docker
      2. 6.5.2 Seccomp the hard way with libseccomp
    6. Summary
  16. 7 Injecting failure into the JVM
    1. 7.1 Scenario
      1. 7.1.1 Introducing FizzBuzzEnterpriseEdition
      2. 7.1.2 Looking around FizzBuzzEnterpriseEdition
    2. 7.2 Chaos engineering and Java
      1. 7.2.1 Experiment idea
      2. 7.2.2 Experiment plan
      3. 7.2.3 Brief introduction to JVM bytecode
      4. 7.2.4 Experiment implementation
    3. 7.3 Existing tools
      1. 7.3.1 Byteman
      2. 7.3.2 Byte-Monkey
      3. 7.3.3 Chaos Monkey for Spring Boot
    4. 7.4 Further reading
    5. Summary
  17. 8 Application-level fault injection
    1. 8.1 Scenario
      1. 8.1.1 Implementation details: Before chaos
    2. 8.2 Experiment 1: Redis latency
      1. 8.2.1 Experiment 1 plan
      2. 8.2.2 Experiment 1 steady state
      3. 8.2.3 Experiment 1 implementation
      4. 8.2.4 Experiment 1 execution
      5. 8.2.5 Experiment 1 discussion
    3. 8.3 Experiment 2: Failing requests
      1. 8.3.1 Experiment 2 plan
      2. 8.3.2 Experiment 2 implementation
      3. 8.3.3 Experiment 2 execution
    4. 8.4 Application vs. infrastructure
    5. Summary
  18. 9 There’s a monkey in my browser!
    1. 9.1 Scenario
      1. 9.1.1 Pgweb
      2. 9.1.2 Pgweb implementation details
    2. 9.2 Experiment 1: Adding latency
      1. 9.2.1 Experiment 1 plan
      2. 9.2.2 Experiment 1 steady state
      3. 9.2.3 Experiment 1 implementation
      4. 9.2.4 Experiment 1 run
    3. 9.3 Experiment 2: Adding failure
      1. 9.3.1 Experiment 2 implementation
      2. 9.3.2 Experiment 2 run
    4. 9.4 Other good-to-know topics
      1. 9.4.1 Fetch API
      2. 9.4.2 Throttling
      3. 9.4.3 Tooling: Greasemonkey and Tampermonkey
    5. Summary
  19. Part 3. Chaos engineering in Kubernetes
  20. 10 Chaos in Kubernetes
    1. 10.1 Porting things onto Kubernetes
      1. 10.1.1 High-Profile Project documentation
      2. 10.1.2 What’s Goldpinger?
    2. 10.2 What’s Kubernetes (in 7 minutes)?
      1. 10.2.1 A very brief history of Kubernetes
      2. 10.2.2 What can Kubernetes do for you?
    3. 10.3 Setting up a Kubernetes cluster
      1. 10.3.1 Using Minikube
      2. 10.3.2 Starting a cluster
    4. 10.4 Testing out software running on Kubernetes
      1. 10.4.1 Running the ICANT Project
      2. 10.4.2 Experiment 1: Kill 50% of pods
      3. 10.4.3 Party trick: Kill pods in style
      4. 10.4.4 Experiment 2: Introduce network slowness
    5. Summary
  21. 11 Automating Kubernetes experiments
    1. 11.1 Automating chaos with PowerfulSeal
      1. 11.1.1 What’s PowerfulSeal?
      2. 11.1.2 PowerfulSeal installation
      3. 11.1.3 Experiment 1b: Killing 50% of pods
      4. 11.1.4 Experiment 2b: Introducing network slowness
    2. 11.2 Ongoing testing and service-level objectives
      1. 11.2.1 Experiment 3: Verifying pods are ready within (n) seconds of being created
    3. 11.3 Cloud layer
      1. 11.3.1 Cloud provider APIs, availability zones
      2. 11.3.2 Experiment 4: Taking VMs down
    4. Summary
  22. 12 Under the hood of Kubernetes
    1. 12.1 Anatomy of a Kubernetes cluster and how to break it
      1. 12.1.1 Control plane
      2. 12.1.2 Kubelet and pause container
      3. 12.1.3 Kubernetes, Docker, and container runtimes
      4. 12.1.4 Kubernetes networking
    2. 12.2 Summary of key components
    3. Summary
  23. 13 Chaos engineering (for) people
    1. 13.1 Chaos engineering mindset
      1. 13.1.1 Failure is not a maybe: It will happen
      2. 13.1.2 Failing early vs. failing late
    2. 13.2 Getting buy-in
      1. 13.2.1 Management
      2. 13.2.2 Team members
      3. 13.2.3 Game days
    3. 13.3 Teams as distributed systems
      1. 13.3.1 Finding knowledge single points of failure: Staycation
      2. 13.3.2 Misinformation and trust within the team: Liar, Liar
      3. 13.3.3 Bottlenecks in the team: Life in the Slow Lane
      4. 13.3.4 Testing your processes: Inside Job
    4. Summary
    5. 13.4 Where to go from here?
  24. Appendix A. Installing chaos engineering tools
    1. A.1 Prerequisites
    2. A.2 Installing the Linux tools
      1. A.2.1 Pumba
      2. A.2.2 Python 3.7 with DTrace option
      3. A.2.3 Pgweb
      4. A.2.4 Pip dependencies
      5. A.2.5 Example data to look at for pgweb
    3. A.3 Configuring WordPress
    4. A.4 Checking out the source code for this book
    5. A.5 Installing Minikube (Kubernetes)
      1. A.5.1 Linux
      2. A.5.2 macOS
      3. A.5.3 Windows
  25. Appendix B. Answers to the pop quizzes
    1. Chapter 2
    2. Chapter 3
    3. Chapter 4
    4. Chapter 5
    5. Chapter 6
    6. Chapter 7
    7. Chapter 8
    8. Chapter 9
    9. Chapter 10
    10. Chapter 11
    11. Chapter 12
  26. Appendix C. Director’s cut (aka the bloopers)
    1. C.1 Cloud
    2. C.2 Chaos engineering tools comparison
    3. C.3 Windows
    4. C.4 Runtimes
    5. C.5 Node.js
    6. C.6 Architecture problems
    7. C.7 The four steps to a chaos experiment
    8. C.8 You should have included <tool X>!
    9. C.9 Real-world failure examples!
    10. C.10 “Chaos engineering” is a terrible name!
    11. C.11 Wrap!
  27. Appendix D. Chaos-engineering recipes
    1. D.1 SRE (’ShRoomEee) burger
      1. D.1.1 Ingredients
      2. D.1.2 Hidden dependencies
      3. D.1.3 Making the patty
      4. D.1.4 Assembling the finished product
    2. D.2 Chaos pizza
      1. D.2.1 Ingredients
      2. D.2.2 Preparation
  28. index
  29. inside back cover

Product information

  • Title: Chaos Engineering
  • Author(s): Mikolaj Pawlikowski
  • Release date: March 2021
  • Publisher(s): Manning Publications
  • ISBN: 9781617297755