Chapter 12. Acting on and Debugging SLO-Based Alerts
In the preceding chapter, we introduced SLOs and an SLO-based approach to monitoring for more effective alerting. This chapter closely examines how observability data is used to make those alerts both actionable and debuggable. SLOs that use threshold-based monitoring data—or metrics—create alerts that are not actionable because they don’t provide guidance on fixing the underlying issue. However, using wide, structured event data for SLOs makes them both more precise and more debuggable.
Regardless of the degree of observability in your systems, using SLOs to drive alerting can be a productive way to make alerting less noisy and more actionable. SLIs can be defined to measure customer experience of a service in ways that directly align with business objectives. Error budgets set clear expectations between business stakeholders and engineering teams. Error budget burn alerts enable teams to ensure a high degree of customer satisfaction, align with business goals, and initiate an appropriate response to production issues—without the type of cacophony common in symptom-based alerting, where an excessive alert storm is the norm.
In this chapter, we dive deep into SLOs to examine error budgets and the mechanisms available to trigger SLO-based alerts. We’ll break down what an SLO error budget is and how it works; the forecasting calculations available for predicting SLO error budget exhaustion; and why it’s necessary to use wide, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access