Chapter 5. Chaos Testing

A relatively famous OSS project called Chaos Monkey came from the developer team at Netflix, and its unveiling to the IT world was quite disruptive. The concept that Netflix had built code that random kills various services in their production environment blew people’s minds. When many teams struggle maintaining their uptime requirements, promoting self-sabotage and attacking oneself seemed absolutely crazy. Yet from the moment Chaos Monkey was born, a new movement arose: chaos engineering.

According to the Principles of Chaos Engineering website, “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” (You can read more at http://principlesofchaos.org/.)

In complex systems (software systems or ecological systems), things do and will fail, but the ultimate goal is stop catastrophic failure of the overall system. So how do you verify that your overall system–your network of microservices–is in fact resilient? You inject a little chaos. With Istio, this is a relatively simple matter because the istio-proxy is intercepting all network traffic, therefore, it can alter the responses including the time it takes to respond. Two interesting faults that Istio makes easy to inject are HTTP error codes and network delays.

HTTP Errors

This simple concept allows you to explore your overall system’s behavior when random faults pop up within ...

Get Introducing Istio Service Mesh for Microservices now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.