Chaos Engineering

Building Confidence in System Behavior through Experiments

Chaos Engineering

Get the free ebook

With so many interacting components, the number of things that can go wrong in a distributed system is enormous. You’ll never be able to prevent all possible failure modes, but you can identify many of the weaknesses in your system before they’re triggered by these events. This report introduces you to Chaos Engineering, a method of experimenting on infrastructure that lets you expose weaknesses before they become a real problem.

Members of the Netflix team that developed Chaos Engineering explain how to apply these principles to your own system. By introducing controlled experiments, you’ll learn how emergent behavior from component interactions can cause your system to drift into an unsafe, chaotic state.

  • Hypothesize about steady state by collecting data on the health of the system
  • Vary real-world events by turning off a server to simulate regional failures
  • Run your experiments as close to the production environment as possible
  • Ramp up your experiment by automating it to run continuously
  • Minimize the effects of your experiments to keep from blowing everything up
  • Learn the process for designing chaos engineering experiments
  • Use the Chaos Maturity Model to map the state of your chaos program, including realistic goals

Please tell us who we’re sharing this with and we’ll email you the ebook.

All fields are required.

Please read our Privacy Policy.

Casey Rosenthal

Casey Rosenthal is an Engineering Manager for the Chaos, Traffic, and Intuition Teams at Netflix. He is a frequent speaker and philosopher of distributed system architectures and the interaction of technology and people.

Lorin Hochstein

Lorin Hochstein is a Senior Software Engineer at Netflix where he works on the Traffic and Chaos team.

Before joining Netflix, Lorin was a Senior Software Engineer at SendGrid Labs, the Lead Architect for Cloud Services at Nimbis Services, a Computer Scientist at the University of California's Information Sciences Institute, and an Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln. Once upon a time, he conducted human subject experiments with programmers, but he hardly ever does that anymore, and his books are almost certainly not part of some elaborate software engineering research study, why would you even think such a thing?

Lorin has a B.Eng. in Computer Engineering from McGill University, an M.S. in Electrical Engineering from Boston University, and a PhD in Computer Science from the University of Maryland.

Aaron Blohowiak

Aaron Blohowiak is a senior software engineer on the Chaos and Traffic team at Netflix. Aaron has a decade of experience taking down production, learning from mistakes, and striving to build ever more resilient systems.

Nora Jones

Nora Jones is passionate about making systems run reliably and efficiently. She is a Senior Software Engineer at Netflix specializing in Chaos Engineering. She has spoken at several conferences and led both software and hardware based Internal Tools and Chaos teams at startups prior to joining Netflix.

Ali Basiri

Ali Basiri is a Sr. Software Engineer at Netflix specializing in distributed systems. As a founding member of the Chaos Team, Ali's focus is on ensuring Netflix remains highly available through the application of the Principles of Chaos.