Chapter 4. Constructing SLIs to Inform SLOs

Once you choose the service(s) you want to measure, you can then think about the SLIs you will use to measure users’ common tasks and critical activities. In our experience, choosing SLIs that represent the customer’s experience and obtaining accurate SLI measurements are two of the most difficult tasks that organizations undertake on their SRE journeys.

The SLO Adoption and Usage Survey found that most organizations (87%) use availability as an SLI. Although availability is an important SLI, it should not be the only SLI you use to measure the reliability of your service. Request latency and error rate are also important metrics indicative of system health. Depending on your service, durability and system throughput should also be considered as metrics.

We are encouraged to see that organizations that adopted SLOs within the last year were the most likely to implement all types of SLI metrics. The industry has long relied on “uptime” as the measure of reliability when, in fact, measuring only time “up” or “down” obscures many important details. Relying on uptime as the measure of reliability is particularly problematic in distributed and cloud computing environments, where systems are usually not binary “up” or “down,” and outages are the result of partial degradation with symptoms that are more diverse than “not working.”

We hope that the information provided in this chapter will help organizations select the best SLIs for their services. We will discuss what SLIs are, how they measure user happiness, how to build SLIs, considerations for choosing measurement strategies, and, finally, how to use SLIs to set SLO targets.

Although the process for identifying, measuring, and monitoring SLIs may seem daunting, keep in mind that having an imperfect SLI is better than no SLI. As your SLI and SLO practices mature, you can build more sophisticated SLIs that more closely correlate with end-user problems.

Defining SLIs

SLIs are quantifiable metrics that measure an approximation of a customer’s experience using your service. Common SLIs include the following:

Availability

For what fraction of events is the service usable by end users?

Request latency

How long does it take to respond to a request?

Error rate

How often does an error occur (typically displayed as a fraction of all requests received)?

Throughput

How much information can the system process (often measured in requests or bytes per second)?

Durability

How likely is it that your system will retain data over a period of time?

SLIs tell you whether you are in or out of compliance with your SLO targets and are therefore in danger of making users unhappy. For example, an SLO target may be that 99.95% of requests will be served within 400 ms in the previous four weeks. The SLI measures your performance against the SLO. If your SLI shows that only 95% of requests were served within 400 ms in the past four weeks, then you missed your SLO. If you continue to miss your SLO, your user experience suffers, and you know that you must take action to bring the SLI back into compliance with the SLO.

SLIs Are Metrics to Deliver User Happiness

How do you quantify user happiness? It’s not easy to measure directly in our systems, but we can look for signals in the user journey. You may experience an outage or other problem that internally seems relatively small, but your users take to Twitter in droves and express their displeasure. Or, you may have a catastrophic event but receive few or no complaints from end users. It is impossible to get inside your users’ heads and see whether they are happy or not while using your service. SLIs specify, measure, and track user journey success.

The key to selecting meaningful SLIs is to measure reliability from the user’s perspective, not your perspective. For example, if your website loads slowly for users, they do not care whether your database went down or your load balancer sent requests to bad backends. All the user thinks is, “I am not happy because the website loads too slowly.” SLIs quantify the user’s complaint that the website is slow. When you understand at what point “the website loads too slowly” impacts user happiness, you can use the data to enhance the customer’s experience.

Specific SLIs that closely represent end-user issues will identify where you should improve the user experience. An ideal SLI is a close to real-time metric expressed as a percentage from 0% to 100%.

With this SLI framework in place, a well-designed SLI should do the following:

  • Increase when customers become happier

  • Decrease when your customers become displeased

  • Show low variance during normal operations

  • Demonstrate very different measurements during outages versus normal operations1

This predictable and linear relationship between your SLIs and user happiness is critical because you will use these indicators to make engineering decisions. If your SLI value falls below your target level for a specified period of time, you know user happiness is suffering, and you should likely dedicate resources to restoring reliability to your service.

Common SLI Types

Choosing SLIs that successfully quantify aspects of the user journey can seem complex, but we have found that most users’ interactions can be collapsed down and mapped to recommended SLI types. Refer to these high-level guidelines as you start thinking about how to measure different aspects of the user’s journey.

Requests and Response

If your service responds to a user’s request, then you should measure how quickly the response occurs and how many responses are successful. If your service relieves excess load by downgrading the quality of the response, you should also measure how often that occurs.

Data Processing

If your service processes data, then users probably have expectations regarding the time it takes to crunch the data. They probably also count on the accuracy of the data returned. SLIs that quantify these interactions include freshness and correctness of the processed data, and the coverage and throughput of the pipeline performing the processing.

Storage

If your service stores data for users, then they expect that they can access the data. To quantify this action, measure the durability of your storage layer.

Table 4-1 outlines the SLIs that you will likely want to measure for various user journeys.

Table 4-1. The SLI menu
Service SLI type
Request/response Availability
Latency
Quality
Data processing Freshness
Coverage
Correctness
Throughput
Storage Durability

SLI Structure

Although many numbers can function as an SLI, we like to structure SLIs as a percentage of good events versus bad events. The following is the SLI equation:

SLI = (Good events/valid events) × 100%

The SLI equation requires that you use only valid events (not all events). When developing SLIs and determining how to collect and measure them, you may want to identify and exclude certain events so that they do not consume your error budget. There should be only a few exclusions. For example, if your system services requests over HTTPs, you may determine validity by request parameters (e.g., hostname or request path) to scope the SLI to a set of response handlers that exercise a specific user-critical code path.

Standardize SLIs

We have also found it helpful to standardize the format of indicators using this structure. A consistent SLI format eliminates the process of reasoning out the structure of SLIs each time you create a new one. All stakeholders will also have an easier time understanding SLIs if they follow a consistent format within or across services.

This ratio allows you to express SLIs as a percentage on a scale of 0%–100%. The structure is intuitive—0% means everything is broken, and 100% means everything works. This format is also easy to apply to SLO targets and error budgets.

From a practical standpoint, a uniform style simplifies writing alerting logic because you can use the same inputs: numerator, denominator, and threshold. Apply the logic to tooling for SLO analysis, error-budget calculations, and reporting.

In addition to standardizing the structure of your SLIs, you can build a set of reusable SLI templates for common metrics. Features that fit into the standard definition templates can be omitted from the specification of an SLI. For example:

  • Aggregation intervals: “Averaged over 1 minute”

  • Data-access latency: “Time to last byte”

  • How frequently measurements are made: “Every 10 seconds”

Aggregate Measurements

Once you have your monitoring strategy set, you must consider how to view the data. We recommend that you aggregate raw measurements. Most SLIs are best expressed as a distribution rather than an average. Consider request latencies. It is possible for most requests to be fast but for a long tail of requests to be very slow. Averaging all requests obscures the tail latencies and the changes in those latencies.

Use percentiles for SLIs so that you can see the distribution and its varying characteristics. For example, a high-order percentile (e.g., 99th or 99.9th) shows you a worst-case value. Using the 50th percentile (the median) shows the typical case. Consider our latency example: the higher the variance in response time, the more the typical user experience is affected by long-tail behavior.

Developing SLIs

Now that you understand the purpose and properties of SLIs, you can formulate indicators. The following steps guide the development of your SLIs.

First, select an application for which you want to establish an SLO. You’re not setting the SLO target yet. Simply choose an application that should have an SLO.

Clearly define the users of this service and identify the common tasks and critical activities they perform when interacting with your service. These are often “user journeys” defined during product or feature creation. These are the people and the interactions whose happiness you want to maximize. Refer to the user journeys defined during product or feature creation when taking this step toward defining SLIs.

Draw the architecture of your system, showing key components, request flows, data flows, and critical dependencies. As you abstract your system, consider grouping your components into the following categories:

Request-driven

The user performs an action and expects a response (e.g., a user interacts with an API for a mobile application).

Pipeline

A process in which the system takes an input, alters it in some way, and puts the output elsewhere. Pipelines can range from single instances that process in real time to multistage batch processes that take several hours.

Storage

Systems that receive and store data that users can access again in the future.

With your system mapped out and your components identified and grouped together, the next step is to choose SLIs that will measure aspects of the user’s experience. If this is your first time selecting SLIs, pick an SLI that is most relevant to the user experience and is easy to measure. Expect some SLIs to overlap. Choose five or fewer SLI types that measure the most important functions for customers.

Finally, review the diagram and determine the SLIs that would measure the user’s experience. As you formulate SLIs, it helps to think of them as having two parts: SLI specification and SLI implementation.

SLI Specifications and SLI Implementations

Breaking SLIs into SLI specifications and SLI implementations is a great way to approach SLI development.

First, articulate the SLI specification. This is the assessment of service outcome that you believe users care about. At this point, do not consider how you will measure it. Focus on what users care about. Refer back to Table 4-1, “The SLI Menu,” to figure out what types of SLIs you want to use to measure your journey.

SLI specifications will be fairly high-level and general, for example, the ratio of home page requests that load in < 100 ms.

Then consider the SLI implementation—how you will measure the SLI specification. Make SLI implementations very detailed. They should be specific enough that someone can write monitoring configurations or software to measure the SLIs. Well-defined SLIs describe the following in detail:

  • What events you are measuring, including any units

  • Where you are measuring the SLI specification

  • What attributes of the monitoring metrics are included and excluded or any validity restrictions that scope the SLI to a subset of events

  • What makes an event good

When considering how to implement your SLI measurements, choose data that is easy to gather. For example, rather than taking weeks to set up probes, use your web server logs if they are readily available.

You may have several possible SLI implementations for an SLI specification. You will have to weigh how well they reflect the customer’s experience (quality) versus how many customers’ experiences they encompass (coverage) versus cost.

Infrastructure considerations for SLI implementations

After deciding on your SLI implementations, assess how your infrastructure serves users’ interactions with your service. SLIs should have a close, predictable relationship with your user’s experience, so choose SLIs that directly measure the performance of your service against the user’s expectations, or in as close proximity to these expectations as possible. Often, you will have to measure a proxy because the most direct or relevant measure is hard to gather or interpret. For example, it is often difficult to measure client-side latency even though it is the most direct SLI. You may be able to measure only latency from the server side, but, if possible, measure related latencies in many locations to help surface problems in the request chain.

Also consider how that infrastructure could fail and how those failures will affect your implementations. Identify failure modes that your SLIs will not capture, and document them. Revise your SLI implementation if it will not capture high-probability or high-risk failures. You may have to change your measurement strategy or supplement it with a second one. For more details on measurement strategies, see Ways of measuring SLIs.

When determining SLI implementations, you will have to weigh the pros and cons of the many options you will have to choose from. The following section details what you may want to consider when selecting SLI types for tracking reliability.

Tracking Reliability with SLIs

Let’s look at how we determine SLIs for availability, latency, and quality to track the reliability of a request response interaction in a user journey.

Availability

Availability is a critical SLI for systems that serve interactive requests from users. If your system does not successfully respond to requests, it is likely that users are not happy with the level of service. To measure reliability in this case, your SLI specification should be the proportion of valid requests served successfully.

Creating the SLI implementation is more difficult. You must decide the following:

  • Which of the requests the system serves are valid for the SLI?

  • What makes a response successful?

The role of the system and the method you choose to measure availability will inform your definition of success. As you map availability for an entire user journey, identify and measure the ways in which users may voluntarily end the journey before completion.

Measuring availability applies in many other circumstances. For example, the SLI specification to measure availability of a virtual machine may be the proportion of minutes that it was booted and accessible via SSH. In this case, creating the SLI implementation will require you to write complex logic as code and export a Boolean-available measure to your SLO monitoring system. You can also define a similar SLI to the virtual machine based on the proportion of minutes the system was available.

Latency

Users will not be happy if a system serving interactive requests does not send timely responses. The SLI specification for a request response latency is the proportion of valid requests served faster than a threshold.

To develop the SLI implementation, you must decide the following:

  • Which of the requests this system serves are valid for the SLI

  • When the timer for measuring latency starts and stops

Selecting a target for what constitutes “fast enough” depends on how well your measured latency captures the user experience. Many organizations use an SLO that measures the long tail. For example, 90% of requests < 450 ms and 99% of requests < 900 ms. The relationship between latency and user happiness tends to be an S curve, so you can better quantify user happiness by setting other thresholds that target latency for 75–90%.

Latency is also an important reliability measure for tracking data processing. For example, if your batch processing pipeline runs daily, it probably should take less than a day to complete. The SLI implementation should reflect the time it takes to complete a task that a user queued because that directly affects their experience.

Quality

Creating quality SLIs is important if your system trades off quality of responses returned to users with another aspect of the service, such as memory utilization. The SLI specification for request response quality is the proportion of valid requests served without degrading quality.

To build the SLI implementation, you must decide the following:

  • Which requests served by the system are valid for the SLI

  • How you determine whether the response was served with degraded quality

Most systems have the ability to mark responses as degraded or to count such instances. This ability makes it easy to represent the SLI in terms of bad events rather than good events. If quality degradation is on a spectrum, set SLO targets at more than one point on the spectrum.

Ways to Measure SLIs

In addition to considering how well a potential SLI measures user happiness, you need to determine how and where you will measure it. We cover five strategies here for measuring SLIs. Each strategy has its own pros and cons that you must weigh when deciding how or whether to implement it. Because you want to choose SLIs that measure the user experience as closely as possible, we will cover the methods in order of proximity to the user.

Gather SLI metrics from processing server-side request logs. Using server-side request logs and data has several advantages. They can monitor the reliability of complicated user journeys that entail a considerable amount of request response interactions during long running sessions. Request logs are also well-suited for organizations that are establishing SLOs for the first time. You can often process request logs retroactively to backfill SLI data. Use the historical performance data to determine a baseline level of performance from which you can derive an SLI.

If SLIs require complicated logic to discern between good and bad events, you can write the logic into the code of your logs and processing jobs and export the number of good and bad events as a simple good-events counter. Counters provide the most accurate telemetry, but the downside of this approach is that building something to process logs reliably will necessitate engineering effort.

The downside to request logs is that processing will result in significant latency between an event occurring and the SLI observing it. This latency can make a log-based SLI a poor fit for triggering emergency responses. Log-based SLIs also will not observe requests that do not make it to your application servers. The same observability issue exists when you export metrics from your application services.

Although exporting metrics from stateless servers does not allow you to measure complicated, multirequest user journeys, it is easy to add application-level metrics (also known as whitebox metrics) that capture the performance of individual requests. These metrics do not result in measurement latency.

Also consider your or your cloud provider’s frontend load balancing infrastructure. This measurement takes you up a level in the stack to measure the interactions that involve users making requests that your service responds to. For most services, this will be the closest in proximity that you will come to the user experience while it is still within your control.

The upside of using frontend infrastructure metrics is that implementing should require little engineering work because your cloud provider should already have metrics and historical data easily available. Unfortunately, because load balancers are stateless, there is no way to track sessions, and cloud providers typically don’t provide response data. In this case, you must ensure that metadata in the response envelope is set accurately to determine whether responses were good.

Another factor to consider is the inherent conflict of interest present—your application server exports metrics for response content, and it is responsible for generating those responses. It may not know that its responses are not good.

That’s when you may choose synthetic clients to measure SLIs. They mimic users’ interactions with your system, possibly from a point outside of your infrastructure. You can verify whether a user’s journey can be completed in its entirety and whether the responses received were good. The downside is that synthetic clients are not an exact replication of user behavior. Users are often unpredictable, and accounting for outlier cases in complicated user journeys can require substantial engineering work. We recommend that you do not use synthetic clients as your only measurement.

Another drawback to this approach is that, in our experience, synthetic probes tend to be finicky, flaky beasts. Synthetic probes sometimes end up sending invalid requests due to neglect or drift from real user behavior, or, without careful tuning, they can trigger content delivery network (CDN) rate limits. One workaround is to remove synthetic outlier requests from error-budget data.

The last measurement strategy to consider is client-side instrumentation. Because the data comes directly from the client, it is the most accurate measure of their experience. It also allows you to gain insights into the reliability of third parties (e.g., payment providers) in users’ interactions. Although client data is the most accurate, it can create the same issues with latency as logs processing, which makes client data unsuitable for triggering a short-term operational response. You may be able to collect client-side metrics in real time via tooling such as StatsD (a popular client metrics collection tool). Collecting client-side metrics in real time may also lower the signal-to-noise ratio of the prospective SLI because this method captures many factors outside your control, such as browser or public network variations.

Use SLIs to Define SLOs

So far, we have discussed SLO fundamentals and have covered several topics related to choosing SLIs, including what makes a good SLI metric, common SLIs, and how to develop SLIs. Now, we can finally talk about how to use SLIs to set SLOs.

SLOs are a target value or range of values for a service level that is measured by an SLI, measured over a specific period of time. As discussed in the Introduction, we typically structure SLOs in the following way:

SLI ≤ target

or

Lower bound ≤ SLI ≤ Upper bound

Your SLOs can fall into two broad categories based on how you determine them: achievable SLOs and aspirational SLOs.

Achievable SLOs

Achievable SLOs are determined by historical data. They are considered achievable because you have enough data to inform a target that you will likely meet most of the time. If you used existing metrics to build SLIs, use the historical data to select a target you will likely meet in the medium and long term.

Underlying the development of achievable SLOs is the idea that your service’s past performance creates your user’s current expectations. You cannot directly measure user happiness, but if your users are not complaining on social media or to customer support, chances are that your reliability target is correct. If performance levels decline, you will miss your SLOs and will have to dedicate engineers to fix the problem.

Because achievable SLOs assume that users are happy with current performance, organizations that implement achievable SLOs must be vigilant in revisiting and reevaluating their targets in the future.2

Aspirational SLOs

Not all organizations have historical data on which they can base reliability targets. Other organizations may know that users are not happy with their current or past performance. Or, the opposite may be true. Your service is more reliable than users expect, affording you the opportunity to establish a less strict target and increase development velocity without impacting user happiness. In these cases, you can establish an aspirational SLO. Business requirements and goals drive the creation of aspirational SLOs.

If you have no historical data, collaborate with your product team to develop a best guess about what will maintain user happiness. Begin measuring your SLIs and gather performance data over a few measurement windows before setting your initial targets. You can also estimate SLOs based on your business needs and existing indicators of user happiness. Taking an educated guess at a reasonable target, measuring it, and refining it over time is better than getting it right the first time.

Because aspirational SLOs are based on something you are trying to achieve, expect to miss them at first and to redefine them as you gather data.

Determine a Time Window for Measuring SLOs

You must apply a time interval to your SLOs. There are two types of time windows: rolling windows (e.g., 30 days from a certain delivery date) and a calendar window (e.g., January 1–31).

Choosing a time window for measuring your SLOs can be difficult because there are many factors to consider. Let’s look at each time interval in more detail.

Rolling windows better align with the experience of your users. If implementing a rolling window, define the period as an integral number of weeks so that it always contains the same number of weekends. Otherwise, your SLIs may vary for unimportant reasons if traffic on the weekend differs greatly from traffic during the week.

Calendar windows align better with business planning. To choose the correct measurement interval, you must consider whether you want to use the data to make decisions more quickly (shorter time frames) or to make strategic decisions that benefit from data collected over a longer period of time.

In our experience, four-week rolling windows work well as a general purpose window. To accommodate quick decision making, we send out weekly summaries that help to prioritize tasks. We then roll up the reports into quarterly summaries, which management uses for strategic project planning.

We recommend defining an SLO with only one time window, rather than different audiences having their own view; this encourages harmony between your development and operations teams. But, you might use that same data recalcuated for different time horizons to derive additional metrics useful for certain stakeholders, like on-call engineers (five minutes for on-call response) or executives (quarterly during a business review).

SLO Examples for Availability and Latency

Most organizations set an SLO for both availability and latency, so let’s look at examples of SLOs you may set for these SLIs.

Availability SLOs answer the question, Was the service available to our user? To formulate the SLO, tally the failures and known missed requests and record errors from the first point in your control. Report the measurements as a percentage. The following is an example of an availability SLO:

Availability: Node.js will respond with a non-500 response code for browser pageviews for at least 99.95% of requests in the month.

or

Availability: Node.js will respond with a non-503 for mobile API calls for at least 99.9% of requests in the month.

Requests that take longer than 30 seconds (or 60 seconds for mobile) count against your availability SLO because the service may have been down.

A latency SLO measures how quickly a service performed for users. To calculate a latency SLO, count the number of queries slower than a threshold and report them as a percentage of total queries. The following is an example of a latency SLO example:

Latency: Node.js will respond within 250 ms for at least 50% of requests in the month and within 3000 ms for at least 99% of requests in the month.

Iterating and Improving SLOs

SLOs should evolve as your system or user journeys evolve. Over time, your system changes, and your current SLOs may not cover new features or new user expectations. Organizations should plan to review their SLO and SLI definitions after a few months and modify them to reflect the current status of your system and user experience.

The SLO Adoption and Usage Survey finds that the majority (54%) of respondents with SLOs do not regularly reevaluate them and that 9% never review them. This is a major oversight by organizations. Frequent reviews are especially important when starting your SLO journey. As we have mentioned several times, the best way to implement SLOs is to start using them and to iterate rather than getting caught up in being perfect. You will find it easier to discover areas you’re not covering, and gaps between SLOs and users if you have something to review.

At the start of your SLO journey, review SLOs as often as once every 1–3 months. Once you establish that the SLO is appropriate, you can reduce reviews to once every 6–12 months. Another option is to align SLO reviews with objectives and key results (OKR) goal setting. Consider all customer groups (i.e., mobile, desktop, or different geographies). Reviews should include an assessment of the SLI and all of the details of the customer groups.

Because SLOs help you maintain reliability while pursuing maximum change velocity, improving the quality of your SLOs lets you judge this balance better, which will help you guide the strategy of your application development and operation.

Summary

SLIs are metrics that measure service performance and help you detect when your users’ experience starts to suffer. Tracking SLIs gives you measurable insights that ultimately help you improve the customer’s experience.

When selecting SLIs, engineers must consider when and how to measure aspects of their service that are critical to user journeys. Enhance accuracy of SLI measurements by selecting implementations that are as close to the customer as possible. You may consider the following five common strategies for measuring SLIs:

  • Gathering SLI metrics from processing server-side request logs

  • Using application-level metrics that capture the performance of individual requests

  • Looking at your frontend load balancing infrastructure to measure interactions involving user requests to which your server responds

  • Using synthetic clients to measure SLIs

  • Implementing client-side instrumentation

With your SLIs identified, you can set achievable and aspirational SLOs, a target level of performance for an aspect of your service. When the SLI is above the SLO threshold, you know customers are happy. If it falls below the target, your customers are typically unhappy.

Now that you have well-defined SLOs and SLIs, you are ready for the next phase of the SRE journey—applying SLOs to error budgets. Using an SLO and error-budget approach to managing your service will unlock the full benefits of SRE methodology.

1 Adrian Hilton and Yaniv Aknin, “Tune Up Your SLI Metrics: CRE Life Lessons,” Google Cloud Platform, January 30, 2019, https://cloud.google.com/blog/products/management-tools/tune-up-your-sli-metrics-cre-life-lessons.

2 For more information on this topic, refer to Theo Schlossnagle (@postwait), “If I could share just one thing about SLOs to frame your appreciation for them as well as your discipline around them, it would be the content on this slide,” Twitter, January 9, 2020, 11:50 a.m., https://twitter.com/postwait/status/1215345069668085761.

Get SLO Adoption and Usage in Site Reliability Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.