Chapter 7. A New Observability Scorecard

Engineers at organizations like Google and Twitter originally promoted observability as a method not just for monitoring their production systems but for being able to understand the behavior of those systems using a relatively small number of signals. Borrowed from control theory, the term observability formally means that the internal states of a system can be inferred from its external outputs. This became necessary within these organizations as the complexity of their systems grew so large—and the number of people responsible for managing them stayed relatively small—that they needed a way to simplify the problem space. In addition, as part of site reliability engineering (SRE) organizations, many of the engineers that were responsible for observability were not working on the software directly, but on the infrastructure responsible for operating it and making it reliable. As such, a model for understanding software performance from a set of external signals was appealing and, ultimately, necessary.

Despite a formal definition, observability continues to elude the understanding of many practitioners. For many, the term is equated with the tools used to observe software systems: metrics, logging, and (as will come as no surprise to the reader) distributed tracing. These three tools became known as the “three pillars of observability,” each a necessary part of understanding system behavior. Though often implemented as separate tools, ...

Get Distributed Tracing in Practice now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.