Chapter 12. Using Service-Level Objectives for Reliability

While observability and traditional monitoring can coexist, observability unlocks the potential to use more sophisticated and complementary approaches to monitoring. The next two chapters will show you how practicing observability and service-level objectives (SLOs) together can improve the reliability of your systems.

In this chapter, you will learn about the common problems that traditional threshold-based monitoring approaches create for your team, how distributed systems exacerbate those problems, and how using an SLO-based approach to monitoring instead solves those problems. We’ll conclude with a real-world example of replacing traditional threshold-based alerting with SLOs. And in Chapter 13, we’ll examine how observability makes your SLO-based alerts actionable and debuggable.

Let’s begin with understanding the role of monitoring and alerting and the previous approaches to them.

Traditional Monitoring Approaches Create Dangerous Alert Fatigue

In monitoring-based approaches, alerts often measure the things that are easiest to measure. Metrics are used to track simplistic system states that might indicate a service’s underlying process(es) may be running poorly or may be a leading indicator of troubles ahead. These states might, for example, trigger an alert if CPU is above 80%, or if available memory is below 10%, or if disk space is nearly full, or if more than x many threads are running, or any set of other simplistic ...

Get Observability Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.