Chapter 1. Why Do Chaos Engineering?

Chaos Engineering is an approach for learning about how your system behaves by applying a discipline of empirical exploration. Just as scientists conduct experiments to study physical and social phenomena, Chaos Engineering uses experiments to learn about a particular system.

Applying Chaos Engineering improves the resilience of a system. By designing and executing Chaos Engineering experiments, you will learn about weaknesses in your system that could potentially lead to outages that cause customer harm. You can then address those weaknesses proactively, going beyond the reactive processes that currently dominate most incident response models.

How Does Chaos Engineering Differ from Testing?

Chaos Engineering, fault injection, and failure testing have a large overlap in concerns and often in tooling as well; for example, many Chaos Engineering experiments at Netflix rely on fault injection to introduce the effect being studied. The primary difference between Chaos Engineering and these other approaches is that Chaos Engineering is a practice for generating new information, while fault injection is a specific approach to testing one condition.

When you want to explore the many ways a complex system can misbehave, injecting communication failures like latency and errors is one good approach. But we also want to explore things like a large increase in traffic, race conditions, byzantine failures (poorly behaved nodes generating faulty responses, misrepresenting behavior, producing different data to different observers, etc.), and unplanned or uncommon combinations of messages. If a consumer-facing website suddenly gets a surge in traffic that leads to more revenue, we would be hard pressed to call that a fault or failure—but we are still very interested in exploring the effect that has on the system. Similarly, failure testing breaks a system in some preconceived way, but doesn’t explore the wide open field of weird, unpredictable things that could happen.

An important distinction can be drawn between testing and experimentation. In testing, an assertion is made: given specific conditions, a system will emit a specific output. Tests are typically binary, and determine whether a property is true or false. Strictly speaking, this does not generate new knowledge about the system, it just assigns valence to a known property of it. Experimentation generates new knowledge, and often suggests new avenues of exploration. Throughout this book, we argue that Chaos Engineering is a form of experimentation that generates new knowledge about the system. It is not simply a means of testing known properties, which could more easily be verified with integration tests.

Examples of inputs for chaos experiments:

Simulating the failure of an entire region or datacenter.
Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.
Injecting latency between services for a select percentage of traffic over a predetermined period of time.
Function-based chaos (runtime injection): randomly causing functions to throw exceptions.
Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.
Time travel: forcing system clocks out of sync with each other.
Executing a routine in driver code emulating I/O errors.
Maxing out CPU cores on an Elasticsearch cluster.

The opportunities for chaos experiments are boundless and may vary based on the architecture of your distributed system and your organization’s core business value.

It’s Not Just for Netflix

When we speak with professionals at other organizations about Chaos Engineering, one common refrain is, “Gee, that sounds really interesting, but our software and our organization are both completely different from Netflix, and so this stuff just wouldn’t apply to us.”

While we draw on our experiences at Netflix to provide specific examples, the principles outlined in this book are not specific to any one organization, and our guide for designing experiments does not assume the presence of any particular architecture or set of tooling. In Chapter 9, we discuss and dive into the Chaos Maturity Model for readers who want to assess if, why, when, and how they should adopt Chaos Engineering practices.

Consider that at the most recent Chaos Community Day, an event that brings together Chaos Engineering practitioners from different organizations, there were participants from Google, Amazon, Microsoft, Dropbox, Yahoo!, Uber, cars.com, Gremlin Inc., University of California, Santa Cruz, SendGrid, North Carolina State University, Sendence, Visa, New Relic, Jet.com, Pivotal, ScyllaDB, GitHub, DevJam, HERE, Cake Solutions, Sandia National Labs, Cognitect, Thoughtworks, and O’Reilly Media. Throughout this book, you will find examples and tools of Chaos Engineering practiced at industries from finance, to e-commerce, to aviation, and beyond.

Chaos Engineering is also applied extensively in companies and industries that aren’t considered digital native, like large financial institutions, manufacturing, and healthcare. Do monetary transactions depend on your complex system? Large banks use Chaos Engineering to verify the redundancy of their transactional systems. Are lives on the line? Chaos Engineering is in many ways modeled on the system of clinical trials that constitute the gold standard for medical treatment verification in the United States. From financial, medical, and insurance institutions to rocket, farming equipment, and tool manufacturing, to digital giants and startups alike, Chaos Engineering is finding a foothold as a discipline that improves complex systems.

Prerequisites for Chaos Engineering

To determine whether your organization is ready to start adopting Chaos Engineering, you need to answer one question: Is your system resilient to real-world events such as service failures and network latency spikes?

If you know that the answer is “no,” then you have some work to do before applying the principles in this book. Chaos Engineering is great for exposing unknown weaknesses in your production system, but if you are certain that a Chaos Engineering experiment will lead to a significant problem with the system, there’s no sense in running that experiment. Fix that weakness first. Then come back to Chaos Engineering and it will either uncover other weaknesses that you didn’t know about, or it will give you more confidence that your system is in fact resilient.

Another essential element of Chaos Engineering is a monitoring system that you can use to determine the current state of your system. Without visibility into your system’s behavior, you won’t be able to draw conclusions from your experiments. Since every system is unique, we leave it as an exercise for the reader to determine how best to do root cause analysis when Chaos Engineering surfaces a systemic weakness.

Chaos Monkey

In late 2010, Netflix introduced Chaos Monkey to the world. The streaming service started moving to the cloud a couple of years earlier. Vertically scaling in the datacenter had led to many single points of failure, some of which caused massive interruptions in DVD delivery. The cloud promised an opportunity to scale horizontally and move much of the undifferentiated heavy lifting of running infrastructure to a reliable third party.

The datacenter was no stranger to failures, but the horizontally scaled architecture in the cloud multiplied the number of instances that run a given service. With thousands of instances running, it was virtually guaranteed that one or more of these virtual machines would fail and blink out of existence on a regular basis. A new approach was needed to build services in a way that preserved the benefits of horizontal scaling while staying resilient to instances occasionally disappearing.

At Netflix, a mechanism doesn’t really exist to mandate that engineers build anything in any prescribed way. Instead, effective leaders create strong alignment among engineers and let them figure out the best way to tackle problems in their own domains. In this case of instances occasionally disappearing, we needed to create strong alignment to build services that are resilient to sudden instance termination and work coherently end-to-end.

Chaos Monkey pseudo-randomly selects a running instance in production and turns it off. It does this during business hours, and at a much more frequent rate than we typically see instances disappear. By taking a rare and potentially catastrophic event and making it frequent, we give engineers a strong incentive to build their service in such a way that this type of event doesn’t matter. Engineers are forced to handle this type of failure early and often. Through automation, redundancy, fallbacks, and other best practices of resilient design, engineers quickly make the failure scenario irrelevant to the operation of their service.

Over the years, Chaos Monkey has become more sophisticated in the way it specifies termination groups and integrates with Spinnaker, our continuous delivery platform, but fundamentally it provides the same features today that it did in 2010.

Chaos Monkey has been extremely successful in aligning our engineers to build resilient services. It is now an integral part of Netflix’s engineering culture. In the last five or so years, there was only one situation where an instance disappearing affected our service. In that situation Chaos Monkey itself terminated the instance, which had mistakenly been deployed without redundancy. Fortunately this happened during the day not long after the service was initially deployed and there was very little impact on our customers. Things could have been much worse if this service had been left on for months and then blinked out in the middle of the night on a weekend when the engineer who worked on it was not on call.

The beauty of Chaos Monkey is that it brings the pain of instances disappearing to the forefront, and aligns the goals of engineers across the organization to build resilient systems.

¹ Julia Cation, “Flight control breakthrough could lead to safer air travel”, Engineering at Illinois, 3/19/2015.

Get Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chaos Engineering by Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri

Chapter 1. Why Do Chaos Engineering?

How Does Chaos Engineering Differ from Testing?

It’s Not Just for Netflix

Prerequisites for Chaos Engineering

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly