Skip to Content
The Site Reliability Workbook
book

The Site Reliability Workbook

by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne
July 2018
Intermediate to advanced content levelIntermediate to advanced
506 pages
13h 58m
English
O'Reilly Media, Inc.
Book available
Content preview from The Site Reliability Workbook

Chapter 5. Alerting on SLOs

This chapter explains how to turn your SLOs into actionable alerts on significant events. Both our first SRE book and this book talk about implementing SLOs. We believe that having good SLOs that measure the reliability of your platform, as experienced by your customers, provides the highest-quality indication for when an on-call engineer should respond. Here we give specific guidance on how to turn those SLOs into alerting rules so that you can respond to problems before you consume too much of your error budget.

Our examples present a series of increasingly complex implementations for alerting metrics and logic; we discuss the utility and shortcomings of each. While our examples use a simple request-driven service and Prometheus syntax, you can apply this approach in any alerting framework.

Alerting Considerations

In order to generate alerts from service level indicators (SLIs) and an error budget, you need a way to combine these two elements into a specific rule. Your goal is to be notified for a significant event: an event that consumes a large fraction of the error budget.

Consider the following attributes when evaluating an alerting strategy:

Precision

The proportion of events detected that were significant. Precision is 100% if every alert corresponds to a significant event. Note that alerting can become particularly sensitive to nonsignificant ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Site Reliability Engineering

Site Reliability Engineering

Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

Publisher Resources

ISBN: 9781492029496Errata Page