Services go down and people have a bad time. Customers who rely on the service become frustrated, other systems that rely on the service stop working, and the people responsible for the system are paged. History suggests1 that even the most celebrated online services are vulnerable to outages, even with hundreds and sometimes thousands of people dedicated to their operation and uptime. As software inexorably increases in complexity,2 old methods of preventing errors and outages prove insufficient.
In the not-so-distant past, best practices around testing, code style, and process gave us confidence that the code that we wrote and deployed would do what we expected it to do. We believe that practices like rigorous testing, Test-Driven Development (TDD), Agile feedback loops, pair programming, and many others can help reduce bugs in the long run. Practices like these are still very important, but they are not sufficient for engineering modern complex systems.
New best practices are needed to give us confidence again in the systems that we build. Best practices are emerging to meet this need, and chaos engineering is among them. Chaos engineering is a new discipline pioneered at Netflix specifically designed to optimize for availability in complex, distributed systems. We can have our confidence, and engineer it, too.
I ran the Chaos Team at Netflix for three years, during the period when ...