Chapter 5. Using Error Budgets to Manage a Service

Error budgets establish a framework within which product owners can manage innovation and product reliability. They provide an objective metric that tells you how unreliable your service can be in a given time period. By not striving for perfection or 100% reliability, there is an expectation of failure that gives product teams the room and SRE teams the ability to embrace risk.

Error budgets operationalize this concept. They incentivize both teams on finding the right balance between achieving both innovation and reliability, goals that seem to be at odds, but that aren’t actually at odds in the highest performing organizations.1 When product and SRE teams agree on the error budget, conversations about velocity versus development become strategic, collaborative, and data driven. The less ambiguous and more data-based decisions can be, the better.

The SLO Adoption and Usage Survey did not explore error-budget usage or practices because not all teams use them. In our experience working with organizations, we often find that many organizations implement error budgets after setting and monitoring SLOs. Our survey data seems to support this: nearly 40% of the respondents who use SLOs implemented them just in the past year, so it’s likely that many organizations are only now beginning to explore how to use error budgets to manage their services. At this level, the benefits of instituting the SRE framework become even more obvious ...

Get SLO Adoption and Usage in Site Reliability Engineering now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.