Chapter 1. Filling the Observability Gap

For decades, IT operations engineers have relied on performance monitoring tools that process statistics from log files or other system output to produce views into the hosts, routers, virtual machines, and applications running in the environment. These tools also produce graphs and dashboards, sometimes by delivering the data to other tools. Monitoring can alert operators to trends, such as increasing traffic, that indicate when the system is at risk of a failure or breach.

But observability is a different kind of intelligence. Observability derives essential insights quickly from all the isolated data points that monitoring tools turn up about TCP/IP sessions, memory usage, and so forth. A consensus definition of observability is reflected in the Wikipedia entry (accessed March 13, 2021): “a measure of how well internal states of a system can be inferred from knowledge of its external outputs.”

Observability uses sophisticated correlations between different types of data, plus advanced analytics that can verge on artificial intelligence, to tell you things, such as “This is the application that’s slowing response time, and it’s slow because it’s starved of CPU.” Operations staff want a single screen that shows them why a problem occurred. Observability is also driven by the need to understand how IT system operations and performance contribute to (or ruin) the digital customer experience—a concern that has taken on much greater business ...

Get Filling the Observability Gap now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.