Chapter 5. Chaos Testing

The unveiling of a relatively famous OSS project by the team at Netflix called “Chaos Monkey” had a disruptive effect on the IT world. The concept that Netflix had built code that randomly kills various services in their production environment blew people’s minds. When many teams struggle just maintaining their uptime requirements, promoting self-sabotage and attacking oneself seemed absolutely crazy. Yet from the moment Chaos Monkey was born, a new movement arose: chaos engineering.

According to the Principles of Chaos Engineering website, “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

In complex systems (software systems or ecological systems), things can and will fail, but the ultimate goal is to stop catastrophic failure of the overall system. So how do you verify that your overall system—your network of microservices—is in fact resilient? You inject a little chaos. With Istio, this is a relatively simple matter because the istio-proxy is intercepting all network traffic; therefore, it can alter the responses including the time it takes to respond. Two interesting faults that Istio makes easy to inject are HTTP error codes and network delays.

HTTP Errors

Based on exercises earlier in this book, make sure that recommendation v1 and v2 are both deployed with no code-driven misbehavior or long waits/latency. Now, you ...

Get Introducing Istio Service Mesh for Microservices, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.