Chapter 8. SLO Monitoring and Alerting
Previous chapters have discussed how SLOs can help support thinking about the user, determine what work various teams do, and aid balancing between reliability and feature velocity. Weâve talked about setting SLOs and how to choose what to do with them, weâve talked about error budgets and how your organization can react when the budget is burned, and weâve talked about what you can do when thereâs budget to play with.
This chapter takes on the topic at the heart of implementing SLOs practically: monitoring, and especially alerting. Itâs a complicated topic, so Iâll start off by explaining a few things that some advanced readers may feel they understand alreadyâif this applies to you, itâs perfectly okay to skip ahead to the âhow toâ section. There may, however, be some useful material in the motivational section that is relevant when you have to convince other people about SLO monitoring/alerting, so you may want to take a look at it anyway.
Despite it being a complicated topic, the good news is that SLO alerting really is one of the most promising developments in the management of production systems today. It promises to get rid of a lot of the chaos, the noise, and the sheer uselessness of conventional alerting that teams experience, and replace them with something significantly more maintainable. However, for this to be possible, you need to substantially change how you think about alerting.
Motivation: ...
Get Implementing Service Level Objectives now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.