Chapter 1. Observability and Chaos

“You see, but you do not observe.”

Sherlock Holmes, from “A Scandal in Bohemia” by Sir Arthur Conan Doyle

Observability and chaos engineering are two relatively new disciplines that, for good reason, the mainstream has begun to recognize. The principles of observability turn your systems into inspectable and debuggable crime scenes, and chaos engineering encourages and leverages observability as it seeks to help you pre-emptively discover and overcome system weaknesses.

In this chapter you’re going to learn how chaos engineering not only relies on observability but also, as a good citizen in your systems, needs to participate in your overall system observability picture.

The Value of Observability

Observability is a key characteristic of a successful system, particularly a production system. As systems evolve increasingly rapidly, they become more complex and more susceptible to failure. Observability is the key that helps you take on responsibility for systems where you need to be able to interrogate, inspect, and piece together what happened, when, and—most importantly—why. Observability brings the power of data to explore and fix issues and to improve products. “It’s not about logs, metrics, or traces, but about being data driven during debugging and using the feedback to iterate on and improve the product,” Cindy Sridharan writes.

Observability helps you effectively debug a running system without having to modify the system in any dramatic way. You can think of observability as being a super-set of system management and monitoring, where management and monitoring has traditionally been great at answering closed questions such as, “Is that server responding?” Observability extends this power to encompass answering open questions such as, “Can I trace the latency of a user interaction in real time?”, or, “How successful was a user interaction that was submitted yesterday?”

Great observability signals help you become a “system detective:” someone who is as able to shine a light on emergent system behavior and shape the mental models of operators and engineers evolving the system. You are able to grasp, inspect, and diagnose the conditions that are the conditions of a rapidly changing, complex, and sometimes failing system. It helps you become the Sherlock Holmes of your own systems, able to ask and answer new questions as your system runs.

The Value of Chaos Engineering

Chaos engineering seeks to surface, explore, and test against system weaknesses through careful and controlled chaos experiments. In modern, rapidly evolving systems, parts fail all the time. Chaos engineering is key to discovering how those complex failures may affect the system and then validating over time that those weaknesses have been overcome.

You could learn from system outages when they occur and improve systems that way. It’s called incident-response learning, and the cycle is shown in Figure 1-1.

An image of the Learning Loop enabled through Incident Response Learning.
Figure 1-1. Incident-response learning is prompted by a system outage

Incident-response learning deserves a book of its own. Enabling it effectively by adopting approaches such as “Blameless Post-mortems” is a way to learn from and overcome system weaknesses. The challenge is that post-mortem learning alone is like learning to drive by jumping into a car for the first time and figuring out how to drive as the car speeds its way along the highway at 90 miles an hour! In other words, dangerous and potentially life-threatening, and you’d better learn quick. Incident-response learning on its own is reactive, usually painful, and possibly very expensive.

Chaos engineering takes a different approach. Instead of waiting for a system weakness to cause a discernible outage, chaos engineering encourages you to actually cause, in a controlled chaos experiment, a failure to explore and is depicted in Figure 1-2.

An image of the Learning Loop enabled through Chaos Engineering Pre-Mortem Learning.
Figure 1-2. Learning through chaos engineering starts with an automated experiment

Chaos Engineering Encourages and Contributes to Observability

Chaos engineering and observability are closely connected. To confidently execute a chaos experiment, observability must detect when the system is normal and how it deviates from that steady-state as the experiment’s method is executed.

When you detect a deviation from the steady-state, then an unsuspected and unobserved system behavior may have been found by the chaos experiment. At this point the team responsible for the system will be looking to the system’s observability signals to help them unpack the causes and contributing factors to this deviation.

Chaos engineering often encourages, even demands, better system observability, but this is only part of the picture. A chaos experiment itself also needs to contribute to that picture, by sending signals like those shown in Figure 1-3, so that you can see which experiment was running when the system was exhibiting a set of observable characteristics.

An image of the steps involved in a chaos experiment's execution.
Figure 1-3. The flow of a chaos experiment’s execution

Summary

Chaos engineering experiments encourage and need to contribute to the observability of a system. In the next chapter you’ll see, with the help of the Chaos Toolkit, the types of observability signals a running chaos experiment can produce.

Get Chaos Engineering Observability now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.