Appendix B. Example Error Budget Policy

Status Published
Author Steven Thurgood
Date 2018-02-19
Reviewers David Ferguson
Approvers Betsy Beyer
Approval date 2018-02-20
Revisit date 2019-02-01

Service Overview

The Example Game Service allows Android and iPhone users to play a game with each other. New releases of the backend code are pushed daily. New releases of clients are pushed weekly. This policy applies both to backend and client releases.

Goals

The goals of this policy are to:

  • Protect customers from repeated SLO misses

  • Provide an incentive to balance reliability with other features

Non-Goals

This policy is not intended to serve as a punishment for missing SLOs. Halting change is undesirable; this policy gives teams permission to focus exclusively on reliability when data indicates that reliability is more important than other product features.

SLO Miss Policy

If the service is performing at or above its SLO, then releases (including data changes) will proceed according to the release policy.

If the service has exceeded its error budget for the preceding four-week window, we will halt all changes and releases other than P01 issues or security fixes until the service is back within its SLO.

Depending upon the cause of the SLO miss, the team may devote additional resources to working on reliability instead of feature work.

The team must work on reliability if:

  • A code bug or procedural error caused the service itself to exceed the error budget. ...

Get The Site Reliability Workbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.