Chapter 1. SLOs: The Magic Behind SRE
As one might gather from the name, Site Reliability Engineering (SRE) prioritizes system reliability.1 Ben Treynor Sloss, Google’s vice president of 24/7 operations, who coined the term SRE, says reliability is the most vital feature of any service. If a system is not available and reliable, customers cannot use it, and the value the product provides diminishes.
No matter where your organization is in adopting SRE methodology, establishing service level objectives (SLOs) is a core best practice that organizations cannot overlook or put off until the future. SLOs safeguard the reliability of your service, and your customers likely will not be happy unless your service is reliable. SLOs set an explicit threshold for how reliable your service should be, which provides all stakeholders within an organization—executives, Site Reliability Engineers, developers, product managers, and operations teams—a common language for measuring service reliability and quality goals.
Thoughtfully constructed, customer-centric, and continuously improved-upon SLOs benefit the entire business. Think of SLOs as goals used to measure success, and service level indicators (SLIs) as the direct metrics of the service’s performance that inform whether the service is meeting the SLO. In other words, SLIs tell you good events from bad events. SLOs tell you what proportion of good/bad events is acceptable, all in support of organizational goals.
Organizations must define objectives (SLOs) and indicators (SLIs) for measuring progress toward the goal. Teams then strive to reach targets by aligning their priorities and decision making with the goals.2 Organizations using SLOs and SLIs can make data-informed decisions about where to invest resources. If your service performs better than its SLOs, then you can invest in greater development velocity (for example, new features). If the system experiences issues and violates its SLOs, stakeholders can slow development and prioritize tasks to make the system more reliable.
In this chapter we define common service-level terminology and detail how organizations can leverage SLOs as powerful business tools.
Defining SRE Terms for Measuring and Managing Your System
Before exploring how SLOs drive business outcomes, we will review and differentiate between SLOs, SLIs, and service level agreements (SLAs). We’ll also explain how, together, they provide a framework for defining and measuring the level of service provided to customers.
As we define these terms, readers may find it helpful to frame the relationship between these three concepts in the following way: SLIs drive SLOs, which inform SLAs.3
SLIs: How Do We Measure Performance Against Our Goals?
You cannot implement SLOs without first identifying and defining SLIs for your service. Google’s Site Reliability Workbook defines SLIs as metrics over time that inform the health of a service. SLIs reflect business objectives and customer expectations; they specify which aspects of the service you are measuring and how you measure them.
Many organizations use SLIs to measure common system characteristics, such as the following:
- Availability
-
The fraction of the time that a service is usable, as expressed by a fraction of well-formed requests that succeed.
- Latency
-
How long it takes to return a response to a request.
- Error rates
-
Expressed as a fraction of all requests.
- System throughput
-
Often measured in requests per second.
- Durability
-
The likelihood that the system will retain the data over a long period of time.
SLIs are quantitative measures often aggregated over a measurement window and expressed as a rate, average, or percentile. We typically prefer to work with SLIs stated as a ratio of two numbers: the number of good events divided by the total number of events. For example, you may look at the number of successful HTTP requests / total HTTP requests. The SLI value would range from 0% (nothing works) to 100% (nothing is broken).
Once you know how you will measure your service using SLIs, you can determine the targets you want to achieve and state them as an SLO. In other words, SLIs are the base metric used to compose an SLO.
SLOs: What Are Our Goals?
SLOs set a precise target level of availability for customers as measured by an SLI. They are defined thresholds for how often SLIs must be met. SLOs allow organizations to judge whether the value of an SLI measurement is good or bad. SLOs align all stakeholders—from individual contributors to vice presidents—on a common definition and a measured standard for reliability, focused on the customer. This common understanding promotes a shared sense of responsibility for the service.
At Google, SLOs frame all conversations about whether the system is running reliably and whether it is necessary to make design or architectural changes in order to meet the SLO.4 Because SLOs have product and business implications, product owners should choose SLOs, with the input of Site Reliability Engineers.
Determine SLO thresholds by looking at your SLIs. SLOs are binding targets for SLIs and may take the form of a target value or range of values. Common structures for SLOs include SLI ≤ target, and lower-bound ≤ SLI ≤ upper-bound. They should specify how they’re measured and the conditions under which they are valid. For example, we might say that 99.9% of GET RPC < 10 ms over 28 days, as measured across all backend server logs.
SLAs: What Level of Service Are We Promising
Our Customers?
SLAs are business-level agreements (often in the form of legal agreements between two organizations) that state the SLOs your service will meet over a given time frame. SLAs also detail the remediation you will take, such as issuing money back or providing free credits, if the service misses the SLOs. The business typically sets the terms of the SLAs with customers, but Site Reliability Engineers focus on defending the SLOs included in SLAs.
We recommend that SLAs contain a slightly less restrictive SLO than your internal target, such as an availability target of 99.95% internally versus 99.9% shared with customers. A more restrictive internal SLO will provide you with an early warning before you violate the SLA.
SLOs Are the Driving Force Behind SRE Teams
As a part of committing to SRE, organizations must believe their success is predicated on service reliability. Site Reliability Engineers cannot manage their services correctly if they have not identified the behaviors most important to the service and to customers or how to measure them. Carefully considered SLOs prioritize the work of SRE teams by providing data points that allow leaders to weigh the opportunity cost of performing reliability work versus investing in functionality that will gain or retain customers.
SLOs establish thresholds for acceptable levels of reliability that SREs must protect and maintain—this is their primary responsibility and drives their priorities and daily tasks. Defending SLOs is also a core competency of Site Reliability Engineers. Their skill set goes far beyond automating processes or troubleshooting outages. Instead, organizations should align SREs’ tasks and priorities with the most important aspects of the most important services as defined by SLOs.
SLOs help SREs defend user happiness by clearly defining a target for service performance. SRE teams can easily see when quality of service declines below the SLO threshold and know that action must be taken.
SLOs are also critical elements in the control loops that SREs use to manage systems:
-
Monitor and measure the system’s SLIs.
-
Compare the SLIs to the SLOs and determine whether teams must take action.
-
If action is needed, determine what needs to happen in order to return service to the target level.
-
Take action.
Without SLOs, organizations will fail to realize the full value of their SRE teams. SLOs allow these specialized engineers to systematically communicate reliability while enhancing reliability by improving the product’s codebase.5
SLOs Are Powerful Business Tools That Drive Financial and Operational Performance
SLOs not only guide SRE teams but also provide insightful perspectives that drive financial and operational performance. How is that possible? SLOs and SLIs eliminate all the often fuzzy definitions of reliability and provide hard data by which you can measure whether your service is meeting its reliability targets.
SLOs and Error Budgets Allow Maximum Change Velocity While Protecting Stability
Part of what makes SLOs and SLIs powerful business tools is that they leave room for failure. We recommend that organizations set SLO thresholds below 100%. Each application has a unique set of requirements that dictate how reliable it must be until customers no longer notice a difference. Most customers will not and cannot distinguish between 100% reliability and 99.9% or, in some cases, 99.0% reliability. However, the costs and technical complexities associated with maintaining 100% reliability are immense. If the target is less than 100%, the organization can allocate resources that would otherwise be spent maintaining 100% reliability to other strategic initiatives.
Setting SLOs below 100% also allows organizations to leverage another important SRE service management tool: error budgets. The difference between 100% reliability and the SLO is your error budget, or the specified amount of downtime the service can experience during a set period. You can have an error budget only if your SLO is less than 100%.
By using an error budget to manage progress toward SLOs, organizations can confidently manage risk and make decisions about when to release features without sacrificing user happiness. All stakeholders must agree that if you exhaust or come close to exhausting the error budget, teams should stop other development work and focus on restoring stability. If you have a sufficient error budget, you can take actions, such as binary releases or configuration pushes, that may result in outages. Although users may experience unhappiness during your allotted downtime, the alternative of not releasing new features will likely cost the business more.
SLOs Keep Business Decisions Focused on
Customer Happiness
It is tempting to react to metrics that your team really cares about—for example, CPU or memory utilization. These metrics possibly indicate reliability issues and are easy to graph and understand, but a central tenet of SRE is that organizations should establish SLOs that measure the service attributes that matter most to their end users. Your users don’t care how many cores you’re using, or how much memory is available, as long as your service works for them.
As discussed earlier, SRE methodology is built on the principle that reliability is the most important attribute of your system. But, SRE also says that’s true only to a point. Because SLOs define the point at which customers will become unhappy with the reliability of your service, they also ensure that SREs, product teams, developer teams, operations teams, and executives understand and measure reliability as it matters to the customer.
SLOs Set Customers’ Expectations
Organizations must set appropriate expectations for the reliability of their service. Including SLOs in SLAs or publishing SLOs for internal customers explicitly communicates how available your service will be. If your service is used as a backend to other services (for example, a database accessed by a web frontend), your SLO needs to be better than the SLO the frontend desires to meet. Having a defined metric prevents users from under-relying on your system or over-relying on your system.
When organizations count on excessive availability, they may create unreasonable dependencies on your system. Google experienced this with Chubby, our lock service for loosely coupled distributed systems. Because Chubby rarely went down, service owners added dependencies to the system. When Chubby did go down, the services could not function properly, causing the outages to be visible to end users.
To solve the problem of over-reliance, SRE ensures Chubby meets but does not exceed its SLOs. If, in any quarter, a true failure does not drop availability below the target, we will deploy a controlled outage.
Organizations also do not want to beat their SLOs too much. If you start running your service at a performance level that’s consistently better than your actual SLO, users will come to expect that level, and they will be unhappily surprised if they’re trying to build other services on top of yours.
Summary
SLOs and error budgets are powerful data points that stakeholders can use to manage their services. SLOs can—and should—be a part of nearly all service-related business discussions. These numerical thresholds give decision makers data-driven insights that allow them to better balance development velocity and operational work.
Adopting an SLO and an error-based approach helps inform development work and associated risk discussions about change management. Teams can spend the error budget as they wish, as long as they do not exceed the SLO. Because the goal is no longer zero outages, SREs and product developers can manage the budget to attain maximum feature velocity.
In the next chapter, we will share the findings of the SLO Adoption and Usage Survey. We will highlight key SLO practices that the data indicates organizations are not implementing. Failure to implement these practices may be keeping organizations from fully attaining the valuable business insights we described throughout this chapter.
1 For our purposes, reliability is defined as “the probability that a [system] will perform a required function without failure under stated conditions for a stated period of time.” See P. O’Connor and A. Kleyner, Practical Reliability Engineering, 5th ed. (Hoboken, NJ: Wiley, 2012), 2.
2 See Fred Moyer, “A Guide to Service Level Objectives, Part 1: SLOs & You,” Circonus, July 11, 2018, https://www.circonus.com/2018/07/a-guide-to-service-level-objectives.
3 Google Cloud Platform, “SLIs, SLOs, SLAs, oh my!” video, 8:04, March 8, 2018, https://www.youtube.com/watch?v=tEylFyxbDLE&t=6m08s.
4 See Jay Judkowitz and Mark Carter, “SRE Fundamentals: SLIs, SLAs and SLOs,” Google Cloud Platform, July 19, 2018, https://cloud.google.com/blog/products/gcp/sre-fundamentals-slis-slas-and-slos.
5 See Kurt Andersen and Craig Sebenik, What Is SRE? An Introduction to Site Reliability Engineering (O’Reilly, 2019).
Get SLO Adoption and Usage in Site Reliability Engineering now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.