Executive Summary

Site Reliability Engineering (SRE) is an emerging IT service management (ITSM) framework that is essential to defending the reliability of an organization’s service by balancing two often competing demands: change/feature release velocity and site reliability. SRE methodology aligns teams on a common strategy for change management. Executives, product owners, developers, and Site Reliability Engineers agree upon a standard definition and acceptable level of reliability and decide what will happen if the organization fails to meet those standards. Organizations use these well-defined, concrete goals to set internal and external expectations for stability and to manage system changes against these specific performance metrics. The result is lower operational costs, enhanced development productivity, and increased feature release.

SRE and DevOps methodologies emphasize different metrics for measuring IT infrastructure and improving software delivery performance. The Accelerate State of DevOps Report published by DevOps Research & Assessment (DORA) identifies four key metrics for measuring software development and delivery (which some refer to as “DevOps performance”): deployment frequency, lead time for changes, time to restore service, and change failure rate. Although these metrics are important, focusing on speed and stability alone is not sufficient for organizations that deliver services and applications online.

DORA’s report also finds that elite DevOps performers prioritize availability. Once you deliver a service to customers, they are unforgiving. Customer happiness and the value of the product diminish if users cannot access your service. SRE organizations deliver value by identifying and monitoring service reliability behaviors that matter most to end users.

To realize the full benefits of SRE, organizations need well-thought-out reliability targets known as service level objectives (SLOs) that are measured by service level indicators (SLIs), a quantitative measure of an aspect of the service. As detailed in the following section, the measurable goals set forth in an organization’s SLOs eliminate the conflicts inherent in change management and event handling that cause the pace of innovation to slow and business to suffer.

Understanding how well your service meets expectations also gives managers valuable business perspectives. SLO compliance can inform whether you invest in making your system faster, more available, or more resilient. Or, if your system consistently meets SLOs, you may decide to invest staff time on other priorities, such as new products or features.

Managing Change with SLOs and Error Budgets

In many organizations, the rift between development and operations teams runs deep. The teams have different vocabulary and assumptions for risk and system stability. Their goals also oppose one another. Development teams want to pursue maximum feature release velocity, while operation teams must protect service stability, which they achieve by rejecting changes. This tension results in considerable indirect costs because each team creates hurdles (e.g., launch and change gates or fewer releases) to prevent the other from advancing their interests. These defensive mechanisms slow feature releases and put stability at risk, which are two results no organization wants.

By design, the SRE framework mitigates this structural conflict by aligning teams on customer-centered SLOs and requiring these teams to comanage an error budget, which dictates when to roll out new releases.

SLOs Solve the Dev/Ops Split

SLOs are a precise numerical target for a service level. A core tenet of SRE is that 100% is the wrong reliability target for your system—in part because 100% reliability isn’t possible. Instead, product teams, with the guidance of SREs, should define SLOs that are less than 100%. If the business meets their well-crafted SLOs, customers should be happy. If the business misses SLO targets, customers will likely complain or will potentially stop using the service.

Product and SRE teams determine appropriate SLOs by evaluating SLIs. SLIs are server- and client-side metrics that measure a level of service, such as request latency and availability. When you understand current performance as measured by SLIs, you can set an appropriate goal, or SLO, for the service. SLOs become the shared reliability goal for development, operations, and product teams.

SRE and product teams manage the service to the SLO using an error budget. Error budgets represent the difference between 100% reliability and the identified SLOs. The error budget is one minus the SLO. A service with a 99.99% SLO has an error budget of 0.01%, meaning that the service may experience 0.01% of degraded service according to the SLIs associated with it. Development and operations teams comanage the error budget, together determining the best use of the permitted unavailability.

The error-budget concept facilitates innovation and fast release of new products and features because it allows for an acceptable failure rate. By removing the internal and external goal of zero outages, downtime becomes an accepted and expected part of innovation. Product and SRE teams can spend the error budget getting maximum feature velocity without the fear associated with failures.

SRE concepts are novel, and they are a paradigm shift for product managers, developers, and operators. It may seem counterintuitive to allow and plan for failure, but maintaining 100% reliability is prohibitive and ultimately slows progress toward business objectives. Google is not the only company to experience optimal reliability and release velocity using SLOs. This report includes case studies that detail how developing SLOs and using error budgets to manage their systems helped two companies—Schlumberger Limited and Evernote—drive better business performance and outcomes.

Key Findings from the SLO Adoption and Usage Survey

As SRE continues to gain popularity as a framework for managing enterprise software systems, we want to support organizations as they explore, consider, and adopt SRE principles. We surveyed industry professionals across a variety of industries, geographical regions, and company sizes to understand how they currently use SRE principles, especially SLOs.

Highlights from the survey include the following:

  • Nearly 54% of the respondents do not currently use SLOs, but half of those respondents plan to do so at some point.
  • Of the 46% of companies that use SLOs, 40% have had them in place for one year or less, and nearly two-thirds have used them for less than three years.
  • We found that 43% of respondents have SRE teams. Of those, 57% implemented their teams within the last three years (31% in the last year and 26% one to three years ago).

We are encouraged to see organizations embracing SRE teams and practices in recent years. However, many organizations may not realize the full advantages of SRE because they do not have SLOs in place. Without SLOs, it is impossible to uncover business insights and drive business outcomes. It is also better to implement SLOs early on and allow them to scale with your business. Establishing even a few SLOs supports reliability, making it less costly to introduce or expand the structure later.

For those using SLOs, we hope that you are seeing the benefits of having an agreed-upon measurement for service reliability. If SLO and SLI measurements do not facilitate decisions about feature release versus stability work, take your SLO practices to the next level by revisiting and revising your SLOs. The survey reveals that only 50% of those with SLOs rarely or never engage in this best practice. Continuous improvement is a core tenet of SRE. Regularly evaluating and refining SLOs ensures that they remain relevant and measure the features most important to user happiness.

Get SLO Adoption and Usage in Site Reliability Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.