Chapter 1. The Three Phases of Observability: An Outcomes-Focused Approach

The cloud native ecosystem has changed how people around the world work. It allows us to build scalable, resilient, and novel software architectures with idiomatic backend systems by using the power of the open source ecosystem and open governance.

How does it do that? Distributed architectures. The introduction of containers made the cloud flexible, and empowered distributed systems. However, the ever-changing nature of these systems can cause them to fail in a multitude of ways. Distributed systems are inherently complex, and, as systems theorist Richard Cook notes, “Complex systems are intrinsically hazardous systems.”1

Think about how many different hazards a container faces: it can be terminated, it can run out of memory, it can fail the readiness probes, or its pods can be evicted from a restarting node, to name a few. These additional complexities are a trade-off for highly flexible, scalable, and resilient distributed architectures.

Distributed systems have many more moving parts. The constant struggle for high availability means that, more than ever, we need observability: the ability to understand changes within a system.

Thanks in large part to Cindy Sridharan’s concept of “three pillars of observability,” introduced in her groundbreaking work Distributed Systems Observability,2 many people think that if you have logs, traces, and metrics (Figure 1-1), you have observability. Let’s look quickly at each of these:

Logs

Logs describe discrete events and transactions within a system. They consist of messages generated by your application over a precise period of time that can tell you a story about what’s happening.

Metrics

Metrics consist of time-series data that describes a measurement of resource utilization or behavior. They are useful because they provide insights into the behavior and health of a system, especially when aggregated.

Traces

Traces use unique IDs to track down individual requests as they hop from one service to another. They can show you how a request travels from one end to the other.

Indeed, as Sridharan makes clear, these are powerful tools that, if understood well, can unlock the ability to build better systems.

Figure 1-1. The three pillars of observability3

However, as Rob Skillington pointed out at 2021’s SREcon,4 simply adding more data (and more types of data) won’t necessarily make observability more effective. After all, adding more data can easily create more noise and disorganization. Uber, he notes, initially used Graphite successfully with tens of microservices but found that it did not scale up to handle hundreds or thousands of microservices.

Martin Mao, along with Skillington, solved Uber’s scaling problem by building M3, Uber’s large-scale metrics platform. He points out that increasing your logs, metrics, and traces does not guarantee a better outcome either. Metrics, like logs and traces, are simply the inputs to observability, but having all three does not necessarily lead to better observability or even proper observability at all. Thus, in our opinion, metrics are the wrong thing to focus on.

If the three pillars of observability don’t in themselves constitute observability, then how do we measure observability? In our view, one of the most impactful ways is to see how well your observability system helps you remediate an issue within the system efficiently. Our approach shifts the focus from what kind of data you have to what kind of outcomes you want to strive for. This is an outcomes-focused approach.

But let’s take another step back and ask why we even want observability at all. What do we want to do with all this data we’re producing? It’s for a single, unchanging purpose: to remediate or prevent issues in the system.

As builders of that system, we want to measure what we know best. We tend to ask about what kinds of metrics we should produce in order to understand if something is wrong with the system and remediate it. Working backward from customer outcomes allows us to focus on where the heart of observability should be: What is the best experience for the customer?

In most cases, the customer (whether they are external or internal) wants to be able to do what they came to do: for example, buy the products they are looking for. They cannot do that if the payment processor isn’t working. We can work backward from there: we don’t want our customers to be unable to buy products, so if the payment processor goes down or becomes degraded, we want to know as soon as possible so we can remediate that issue. To do that, we need to ensure that we can detect payment processor downtime quickly, then triage to make sure we know the impact and the root cause, all while looking for opportunities to rapidly remediate, stopping the customer’s pain.

Once you find the outcomes you are looking for, then the signals (metrics, logs, and traces) can play a role. If your customers need error-free payment processing, you can craft a way to measure and troubleshoot that. When deciding on signals, then, we endorse starting from the outcomes you want.

In response to Sridharan, we call our approach the three phases of observability (Figure 1-2).

Figure 1-2. The three phases of observability5

As part of a remediation process, the three phases can be described in the following terms:

  1. Knowing quickly within the team if something is wrong

  2. Triaging the issue to understand the impact: identifying the urgency of the issues and deciding which ones to prioritize

  3. Understanding and fixing the underlying problem after performing a root cause analysis

Some systems are easier to observe than others. The key is understanding the system in question.

Let’s say you work for an ecommerce platform. It’s the annual Black Friday sale, and millions of people are logged in simultaneously. Here’s how the three phases of observability might play out for you:

Phase 1: Knowing

Suddenly, multiple alerts fire off to notify you of failures. You now know that requests are failing.

Phase 2: Triaging

Next, you can triage the alerts to learn which failures are most urgent. You identify which teams you need to coordinate. Then learn if there is any customer impact. You scale up the infrastructure serving those requests and remediate the issue.

Phase 3: Understanding

Later on, you and your team perform a postmortem investigation of the issue. You learn that one of the components in the payments processor system is scanning multiple users and causing CPU cycles to increase tenfold—far more than necessary. You determine that this increase was the root cause of the incident. You and the team proceed to fix the component permanently.

In this example, you resolved an issue using observability, even though you didn’t use all three signals. Looking only at the metrics dashboard, you determined which systems were causing the issue and guided the infrastructure team in fixing it.

Just like in mathematics, there are multiple ways to arrive at the correct answer; the important thing is to do so quickly and efficiently. If you can remediate a problem by relying only on your previous knowledge of the system, without using metrics, logs, and traces, that is still a good outcome. You remediated the problem, and that’s the real goal. And, of course, this is made easier with correct signals that are outcomes-based and can quickly validate any remediation assumptions!

Remediating at Any Phase

Although we posit three phases, at any phase, your goal is always to remediate problems. If a single alert is firing off and you can remediate the issue by using only visibility (phase 1), you should do so. You don’t have to triage or do a root cause analysis every time if these are unnecessary.

To illustrate this point, let’s say a scheduled deployment breaks your production environment. There is no need to triage or do root cause analysis here, since you already know that the deployment caused the breakage. Simply rolling back the deployment when errors become visible remediates the issue.

The Three Phases Illustrated

In real life, if your system is crashing, you don’t focus on the data. You focus on fixing the problem immediately. No one does root cause analysis without fixing the current issue and mitigating customers’ pain.

Take, for example, a burning house (Figure 1-3).

Figure 1-3. Remediating a burning house: you should put out the fire before you start investigating the cause

If your house is on fire, how do you know? Most likely, your smoke alarm goes off, emitting a loud, unmistakable noise that notifies you of the problem. That smoke alarm is the alert, triggered by sensors detecting smoke in the room. Metrics can tell you what the issue is and give you enough information to address it. This is surface-level detection but enough to continue investigating. That’s phase 1. Metrics should give you a low mean time to detect (MTTD), so a sensitive fire alarm that goes off at the first sign of smoke or heat will be better than one that lets the fire spread for several minutes before notifying you—and better still than no alarm at all.

What now? You might jump out of bed and look around the house to see where the fire is, then get everyone out immediately and call emergency services. That’s a temporary remediation (phase 2): you’re all safe, but the house is still on fire. It’s also triaging: you are choosing to prioritize safety over other things, like saving your favorite electronics.

The sooner you can call emergency services, the faster they will arrive to put the fire out. In observability, we call this interval mean time to remediate (MTTR). This, too, should be as low as possible: if the firefighters arrive quickly and start hosing down the house right away, part of the house could be saved. If anyone is injured, you’ll want the paramedics to arrive quickly to help them: that is, you want a low MTTR.

The next morning, with everyone safe and the last embers extinguished, the fire marshal and insurance investigator examine the house to see what started the fire (a root cause analysis, phase 3). Perhaps they learn that a faulty cord on an appliance overheated. The appliance manufacturer might even recall the product to ensure that the faulty cords don’t start any more fires, a still more permanent remediation that keeps even more people safe.

No one does an investigation during an active fire as there is still a threat of injury. Similarly, the worst time to do a deep dive on how exactly a system misbehaves is when there is an ongoing outage.

You do phases 1 and 2 immediately, before you try to figure out where the fire started or why. You focus on the outcome of keeping everyone safe. One way to fulfill that is to use metrics as your starting point. In this case, since smoke in the room is the metric, then once you smell smoke, you automatically evacuate the house.

1 Richard Cook, “How Complex Systems Fail,” Cognitive Technologies Laboratory, 2000, https://oreil.ly/zw73j.

2 Cindy Sridharan, Distributed Systems Observability (O’Reilly Media, 2018), https://oreil.ly/v8PUu.

3 Sridharan, Distributed Systems Observability.

4 Rob Skillington, “SREcon21—Taking Control of Metrics Growth and Cardinality: Tips for Maximizing Your Observability,” USENIX, October 14, 2021, YouTube video, 27:21, https://oreil.ly/gvAq7.

5 Adapted from an image in Rachel Dines, “Explain It Like I’m 5: The Three Phases of Observability,” Chronosphere, August 10, 2021, https://chronosphere.io/learn/explain-it-like-im-5-the-three-phases-of-observability.

Get Cloud Native Monitoring now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.