Chapter 4. Choosing Good Service Level Objectives

Every system will fail at some point. Sometimes systems will fail in catastrophic ways, and other times they can fail in ways that are barely noticeable. System failures can be lengthy or last just fractions of a second. Some system failures require human intervention in order to get things back into a good state, while in other cases the system may start operating correctly again all by itself.

Chapter 3 discussed how to think about measuring if a service is doing what it is supposed to be doing, and from that framing we can define failure as when a service is not doing what it is supposed to be doing. A service failure does not have to mean there is an emergency. Things as simple as an API not sending a response quickly enough or the click of a button on a web form not registering are both failures. Failure happens all the time, because complex systems are fallible. This is all totally fine and expected and shouldn’t stress you out.

Problems only arise when failures occur too often or last too long, and that’s what service level objectives are all about. If a service level indicator gives you a good way to think about whether your service is performing in the manner it should be, a service level objective gives you a good way to think about whether your service is doing so often enough.

We’ve established that you can’t be perfect, but how good should you try to be instead? This chapter looks to help you figure that out. First, we’ll talk about what SLO targets really are and why it’s important to choose them to the best of your ability. Second, we’ll spend a little bit of time talking about service components and dependencies and how to take these into consideration when setting SLOs. After that, we’ll get into some of the ways you can use data to help you pick these targets, including an introduction to the basic statistics you’ll need in order to do so.

Reliability Targets

Fundamentally, SLOs are targets: they’re a defined set of criteria that represent an objective that you’re trying to reach. Good SLOs generally have two traits in common:

  1. If you are exceeding your SLO target, your users are happy with the state of your service.

  2. If you are missing your SLO target, your users are unhappy with the state of your service.

But what exactly do we mean by user happiness?

User Happiness

When we talk about user happiness in terms of service reliability, we’re mostly appealing to the contentment aspect of happiness. It’s not necessarily the case that the users of your service have to be actively and consciously overjoyed with their experience in order for them to be happy. For some people, it might be easier to think about it in terms of your users not being unhappy.

At some level, these ideas come from the concept that you need satisfied users in order to have a growing business. Reliability is a service feature that will often determine if people will choose to use yours as opposed to another one. Chances are that one of the goals of your service is to attract more users, even if you aren’t strictly a business. Being reliable, and thinking about the happiness of your users, is a major component of this.

This is also applicable to services that do not strictly serve customers. For example, if you’re in charge of the database offering for your organization, and your offering is seen as too unreliable by other engineers, they’re going to find ways to work around this. They might spin up their own database instances when they really shouldn’t, or they might try to solve data storage problems in a suboptimal manner that doesn’t involve a database at all.

We could also imagine an internal service that users can’t find a workaround for. Perhaps you maintain the Kubernetes layer at your organization. If users of this service (your fellow engineers) are too unhappy about its reliability, they’ll eventually get fed up and find some way to move to a different service—even if that means actually leaving the company.

You want to make sure that you’re reliable, and you want to make sure that your users are happy. Whatever targets you choose, they have to be ones that keep this in mind.

The Problem of Being Too Reliable

That all being said, you also don’t want to be too reliable. There are a few reasons for this.

Imagine, for example, that you’ve chosen an SLO target percentage of 99.9%. You’ve done a lot of due diligence and followed the advice in this book in order to determine that this is the right objective for you. As long as you’re exceeding this 99.9%, users aren’t complaining, they aren’t moving elsewhere, and your business is growing and doing well.

Additionally, if you miss this target by just a little bit, you likely won’t immediately hemorrhage users. This is ideal, since it gives you time to say, “We’ve missed our target, so now we need to refocus our efforts to ensure we stay above it more often.” You can use the data that your SLO provides you in order to make decisions about the service and the work you’re performing.

However, let’s now imagine that you’re routinely being 99.99% reliable instead of just hitting your 99.9% target. Even if your SLO is published and discoverable, people are going to end up expecting that things will continue to be 99.99% reliable, because humans generally expect the future to look like the past.1 Even if it was true that in the past everyone was actually happy with 99.9%, their expectations have now grown. Sometimes this is absolutely fine. Services and products can mature over time, and providing your users with a good experience is never a bad idea.

So maybe you make your official target more stringent, and now you aim for 99.99%. By doing so you’re giving yourself fewer opportunities to fail but also fewer opportunities to learn. If you’re being too reliable all the time, you’re also missing out on some of the fundamental features that SLO-based approaches give you: the freedom to do what you want. If you’re being too reliable, you’re missing out on opportunities to experiment, perform chaos engineering, ship features quicker than you have before, or even just induce structured downtime to see how your dependencies react—in other words, a lot of ways to learn about your systems.

Additionally, you need to think about the concept of operational underload. People learn how to fix things by doing so. Especially in complex systems, you can learn so much from failures. There is almost no better way to learn about how systems work than to respond to them when they aren’t performing how they’re supposed to. If things never fail, you’ll be missing out on all of that.

Tip

Chapter 5 goes into much more detail about how to use error budgets, but ensuring you don’t lose insight into how your services work by inducing failure or allowing it to occur is one of the main components at play. If your users and your business only need you to be 99.9% reliable, it is often a good idea to make sure you’re not far beyond that. You’ll still want to make sure that you’re able to handle unforeseen issues, but you can set appropriate expectations as well as provide useful learning opportunities if you make sure you’re not too reliable all the time. Pick SLO target percentages that allow for all of this to be true when you can.

The Problem with the Number Nine

In addition to the desire to be too reliable, there is another problem you can run into when picking the correct SLO for your service. When people talk about SLOs and SLAs, they most often think about things in terms of “nines.”

Even if you don’t want to aim for 100% reliability, you do almost always want to be fairly reliable, so it’s not surprising that many common reliability targets are very close to 100%. The most common numbers you might run into are things like 99%, 99.9%, 99.99%, or even the generally unattainable 99.999%.2 These targets are so common, people often even refer to them as just “two nines,” “three nines,” “four nines,” and “five nines.”

Table 4-1 shows what these targets actually look like in terms of acceptable bad time.3

Table 4-1. SLO targets composed of nines translated to time
Target Per day Per month Per year
99.999% 0.9 s 26.3 s 5 m 15.6 s
99.99% 8.6 s 4 m 23 s 52 m 35.7 s
99.9% 1 m 26.4 s 43 m 49.7 s 8 h 45 m 57 s
99% 14 m 24 s 7 h 18 m 17.5 s 3 d 15 h 39 m

Not only can hitting these long strings of nines be much more difficult and expensive than people realize, but there is also a general problem where people only think about SLO targets as comprising series of the number nine, when in reality this doesn’t make any sense at all. Picking the right target for your service involves thinking about your users, your engineers, and your resources—it shouldn’t be arbitrarily constrained in this way.

You might also see targets such as 99.95% or 99.98%, and including these is certainly an improvement over using only the number nine, but even here you’re not always allowing yourself enough nuance to describe the requirements of your exact service.

There is absolutely nothing wrong with having an SLO defined as having a target of something like 99.97%, 98.62%, or even 87%. You can address having low target percentages by using percentiles in your definition—we’ll talk more about that later in this chapter—but you should also make sure you aren’t tied to thinking about these targets just in terms of the number nine.

Table 4-2 shows some other options and what amounts of bad time those translate into.

Table 4-2. SLO targets composed of not-just-nines translated to time
Target Per day Per month Per year
99.95% 42.2s 5m2.4s 4h22m58.5s
99.7% 4m19.2s 30m14.4s 1d2h17m50.9ss
99.3% 10m4.8ss 5h6m48.2s 2d13h21m38.7s
98% 28m48s 14h36m34.9s 7d7h18m59s

That’s not to say you should be aiming at a lower target if you don’t have a reason to do so, but the difference between 99.9% and 99.99% (or something similar) is often much greater than people realize at first. You should be looking at the numbers in between as well.

Sometimes it’s helpful to start with a time rather than a percentage. For example, it might be reasonable (or even required due to the daily downtime of your dependencies, locking backups taking place, and so on) to want to account for about two hours of unreliability per month. In that case 99.7% would be the correct starting point, and you could move on from there after seeing how you perform at that target for some time. Some of the most useful SLOs I have personally worked with have been set at carefully measured numbers like 97.2%, and there is nothing wrong with that. Later in this chapter we’ll discuss in more depth how to do this math and make these measurements.

The Problem with Too Many SLOs

As you start on your journey toward an SLO-based approach to reliability, it might be tempting to set a lot of SLOs for your services. There is no correct number of SLOs to establish, and the number that will be correct for you will heavily depend on both how complex your services are and how mature the SLO culture in your organization is.

While you do want to capture the most important features of your system, you can often accomplish this by measuring only a subset of these features. Always ask yourself what your users need, and start by observing the most important and common of these needs. SLOs are a process, and you can always add (or remove!) them at any point that makes sense.

When the number of SLOs you have grows to be too large, you’ll run into a few particular problems. First, it will be more difficult to make decisions using your data. If you view SLOs as providing you with data you can use to make meaningful choices about how to improve your service, having too many divergent data points can result in these decisions being harder to make. Imagine for example a storage service with a simple caching layer. It might not be necessary to have separate SLOs for both cache miss latency and cache hit latency for reads. You’ll certainly still want to be collecting metrics on both, but you might just be adding noise if you have entirely independent SLOs for each. In this situation you could just have an SLO for general read latency, and if you start performing badly against your target, you can use your separate metrics to determine where the problem lies—hits, misses, or both—and what you need to address to make things better.

The second problem you can run into is that it becomes more complicated to report to others what the reliability status of your service has been. If you can provide someone outside of your team with the status of three to five SLOs over time, they can probably infer from that data both how your service has been running and how they could set their own targets if they depend on it. If they have to sort through dozens of SLOs, perhaps all with different target percentages and histories, you’re not benefiting from one of the elements that this whole process is about: communicating your reliability to others in an easy-to-understand way.

More generally, there are statistical issues that arise with too many measurements. The multiple comparison problem, at its most basic, is one that arises due to the fact that if you have many different measurements of the same system, there are greater chances of incorrect measurements taking place. And even if the measurements are actually correct, if you’re looking at too many things you’ll always find something that looks just slightly off, which can just waste your time by sending you down endless rabbit holes.

Note

Every system is unique, and there is no perfect answer to the question of how many SLOs you should define for any particular service. As with everything, try to be reasonable. SLOs are about providing you data to have discussions about, and you can’t do that if you have too many data points to discuss.

Service Dependencies and Components

No service stands alone; everything depends on something else. Downstream, microservices often have dependencies that look like other microservices and databases. Services that appear to be mostly standalone will always have upstream dependencies, such as load balancers, routers, and the network in general. In both of these situations, these services will be dependent upon a compute layer of some sort, be that a container orchestration layer, a virtual machine infrastructure, or an operating system running on a bare-metal physical machine.

And we can go much deeper than that. An operating system running on a physical machine is dependent on that physical machine, which is dependent on things like power circuits, which are dependent on delivery from an electrical substation, and so forth. We could continue down this path virtually infinitely.

Because everything has many dependencies, it also turns out that services often have many components. Complex computer systems are made up of deep interwoven layers of service dependencies, and before you can set appropriate SLO targets, you need to understand how the various components of your services interact with each other.

Service Dependencies

When thinking about what kind of objective you can set for your service, you have to think about the dependencies your service has. There are two primary types of service dependencies. First are the hard dependencies. A hard dependency is one that has to be reliable for your service to be reliable. For example, if your service needs to read from a database in order to do what it is supposed to do, it cannot be reliable if that database isn’t. Second are soft dependencies. A soft dependency is something that your service needs in order to operate optimally but that it can still be reliable without. Converting your hard dependencies into soft ones is one of the best steps you can take to make your service more reliable.

To choose a good service level objective, you have to start by examining how reliable your dependencies are. There’s some simple math you can do to calculate the effect they have on the reliability your service can offer; I’ll show you that after we dig a little more deeply into the issues of hard and soft dependencies.

Hard dependencies

Understanding the effect your known hard dependencies have on your service is not an overly complicated ordeal.4 If the reliability of your service directly depends on the reliability of another service, your service cannot be any more reliable than that one is. There are two primary ways you can determine the reliability of a hard dependency.

The first is just to measure it. To continue with our database example, you can measure how many requests to this database complete without a failure—whether that be without an error or timeout, or quickly enough—directly from your own service. You don’t have to have any kind of administrative access to the database to understand how it works from your perspective. In this situation, you are the user, and you get to determine what reliable means. Measure things for a while, and use the result to determine what kind of reliability you might be able to expect moving into the future.

The second, and more meaningful, way is to look at the published SLOs and reliability history of your dependencies, if they have them and they’re shared with users. If the team responsible for the database you depend upon has internalized the lessons of an SLO-based approach, you can trust them to publish reasonable SLO targets. You can trust that team to take action if their service starts to exceed its error budget, so you can safely set your target a little lower than theirs.

Soft dependencies

Soft dependencies are a little more difficult to define than hard dependencies, and they also vary much more wildly in how they impact the reliability of your service. Hard dependencies are pretty simple to define and locate, and if a hard dependency isn’t being reliable—whether it’s entirely unavailable or just responding slowly—your service isn’t being reliable during that time frame, either.

Soft dependencies, however, don’t have this same one-to-one mapping. When they’re unreliable the reliability of your service may be merely impacted, not nullified. A good example is services that provide additional data to make the user experience most robust, but aren’t strictly required for it to function.

For example, imagine a maps application on your phone. The primary purpose of such an application could be to display maps of your immediate surroundings, show what businesses or addresses are located where, and help you orient yourself. The application might also allow you to overlay additional data such as traffic congestion, user reviews of restaurants, or a satellite view. If the services that provide this traffic, user review, or satellite maps data aren’t operating reliably, it certainly impacts the reliability of the maps application, but it doesn’t make it wholly unreliable since the application can still perform its primary functions.

Turning hard dependencies into soft dependencies

One of the best things you can do in terms of making your service more reliable is to remove the hard dependencies it might have. Removing hard dependencies is not often a viable option, however, so in those situations you should think about how you might at least be able to turn them into soft dependencies instead.

For instance, going back to our database example, you might be able to introduce a caching layer. If much of the data is similar—or it doesn’t necessarily have to be up-to-date to the second—using a cache could allow you to continue to operate reliably from the perspective of your users even if there are failures happening on the backend.

The topic of turning hard dependencies into soft ones is way too large for this book, but remember to think about this as you determine your SLO targets and use this as an example of the kind of work you could perform to increase the reliability of your service.

Dependency math

Perhaps the most important part of thinking about your service dependencies is understanding how to perform the math you need in order to take their reliability into account. You cannot promise a better reliability target than the things you are dependent on.

Most services aren’t just individual pieces that float around in an empty sea. In a world of microservices where each might have a single team assigned to it, these services work together as a collective to comprise an entirely different service, which may not have a dedicated team assigned to it. Services are generally made up of many components, and when each of those components has its reliability measured—or its own SLO defined—you can use that data to figure out mathematically what the reliability of a multicomponent service might look like.

An important takeaway for now is how quickly a reasonable reliability target can erode in situations such as this. For example, let’s say your service is a customer-facing API or website of some sort. A reasonably modern version of a service such as this could have dozens and dozens of internal components, from container-based microservices and larger monoliths running on virtual machines, to databases and caching layers.

Imagine you have 40 total components, each of which promises a 99.9% reliability target and has equal weight in terms of how it can impact the reliability of the collective service. In such situations, the service as a whole can only promise much less than 99.9% reliability. Performing this math is pretty simple—you just multiply 99.9% by itself 40 times:

0 . 9999 40 = 0 . 96077021

So, 40 service components running at 99.9% reliability can only ensure that the service made up of these components can ever be 96% reliable. This math is, of course, overly simplistic compared to what you might actually see in terms of service composition in the real world, and Chapter 9 covers more complicated and practical ways to perform these kinds of calculations. The point for now is to remember that you need to be reasonable when deciding how stringent you are with your SLOs—you often cannot actually promise the reliability that you think or wish you could. Remember to stay realistic.

Service Components

As we’ve seen (for example, in the case of the retail website discussed in the previous chapter), a service can be composed of multiple components, some of which may themselves be services of different types. Such services generally fall into two categories: those whose components are owned by multiple teams, and those whose components are all owned by the same team. Naturally, this has implications when it comes to establishing SLIs and SLOs.

Multiple-team component services

When a service consists of many components that are owned by multiple teams, there are two primary things to keep in mind when choosing SLIs and SLOs.

The first is that even if SLOs are set for the entire service, or a subset of multiple components, each team should probably have SLIs and SLOs for its own components as well. The primary goal of SLO-based approaches to reliability is to provide you with the data you need to make decisions about your service. You can use this data to ask questions like: Is the service reliable enough? Should we be spending more time on reliability work as opposed to shipping new features? Are our users happy with the current state of the world? Each team responsible for a service needs to have the data to consider these questions and make the right decisions—therefore, all the components of a service owned by multiple teams should have SLOs defined.

The second consideration about services owned by many teams is determining who owns the SLOs that are set for the overarching service, or even just subsets of that service. Chapter 15 addresses the issue of ownership in detail.

Single-team component services

For services that consist of multiple components that are all owned by a single team, things can get a bit more variable. On the one hand, you could just apply the lessons of the multiple-team component services and set SLOs for every part of your stack. This is not necessarily a bad idea, but depending on the size or complexity of the service, you could also end up in a situation where a single team is responsible for and inundated by SLO definitions and statuses for what realistically is just a single service to the outside world.

When a single team owns all the components of what users think of as a service, it can often be sufficient to just define meaningful SLIs that describe enough of the user journeys and set SLOs for those measurements. For example, if you’re responsible for a logging pipeline that includes a message queue, an indexing system, and data storage nodes, you probably don’t need SLOs for each of those components. An SLI that measures the latency between when a message is inserted and when it is indexed and available for querying is likely enough to capture most of what your users need from you. Add in another SLI that ensures data integrity, and you’ve probably got most of your users’ desires covered. Use those kinds of SLIs to set the minimum number of SLO targets you actually need, but remember to also use telemetry that tells you how each component is operating to help you figure out where to apply your reliability efforts when your data tells you to do so.

Reliability for Things You Don’t Own

The classic example of how SLOs work involves a dichotomy between a development team and an operational team, both responsible for the same service in different ways. In this prototypical example, the development team wants to move fast and ship features, and the operations team wants to move slowly to ensure stability and reliability. SLOs are a way to ease this tension, as they give you data specifically aimed at determining when to move fast and when to move slow. This is the basic foundation of Site Reliability Engineering.

If your team or service doesn’t fit into this model, however, that doesn’t mean you can’t adopt an SLO-based approach. If the service your team supports is open source, proprietary from a vendor, or hardware, you can’t really use the primary example of “stop shipping features and focus on reliability code improvements instead”—but that doesn’t mean you can’t shift your focus to reliability. You just have to do it in a slightly different manner.

Open Source or Hosted Services

If you’re relying on open source software for your infrastructure, as many companies do, you can still make changes to improve reliability—it’s just that the changes you make are not always directly applicable to the code at the base of things. Instead, they’re likely things like configuration changes, architecture changes, or changes to in-house code that complements the service in some way. This isn’t to say that these sorts of changes don’t also apply to services for which you own the entire codebase—just that the classic examples of how SLO-based approaches work often overlook them.

Additionally, you might be reliant on software that is entirely hosted and managed. This can make reliability approaches even more difficult, because in these situations there may not be many configuration or architecture changes you can make. Instead, when thinking about SLOs for these sorts of services, you might start with a baseline that represents the amount of failure a user can tolerate and use this data to justify either renewing a contract or finding a new vendor that can meet your needs.

Measuring Hardware

Chances are there are many different hardware components you might like to observe and measure, but it’s not often worth your time unless you’re operating at a certain scale. Commercial and enterprise-grade computer hardware is generally already heavily vetted and measured by the manufacturers, and you often cannot develop a system of measurement with enough meaningful data points unless you are either a hardware development company, a telco/internet service provider, or one of the largest web service providers. Remember that unless you introduce complicated math to normalize your data, you generally need quite a few data points in order to ensure that your SLIs aren’t triggered only by outliers.

That all being said, you don’t have to operate at the scale of a telco or one of the largest tech companies to meaningfully measure the failure rates or performance of your hardware. For example, imagine you’re responsible for 2,000 servers in various data centers across the planet. Though the number 2,000 isn’t necessarily very large when it comes to statistical analysis, the numbers derived from it could be. You might have 8 hard drives or DIMMs per server, which gives you 16,000 data points to work with. That might be enough for you to develop meaningful metrics about how long your hardware operates without a fault.

Another option is to get aggregated data from other sources, and then apply those same metrics to your own hardware. It can be difficult to get failure rate data from vendors directly, but many resellers collect this data and make it available to their paying customers. You can use this sort of information to help you anticipate the potential failure rates of your own hardware, allowing you to set SLOs that can inform you when you should be ordering replacements or when you should be retiring old systems.

In addition to reseller vendors, there are other aggregated sources of data about hardware failure. For example, Backblaze, a major player in the cloud storage space, releases reports every year about the failure rates of the various hard drive makes and models it employs.

The point is that if you don’t have a large enough footprint to use your own measurements to develop statistically sound numbers, you can rely on those who have done this aggregation for you. We’ll also be discussing statistical models you can use to meaningfully predict things you only have sporadic data for in Chapter 9.

But I am big enough!

Of course, you might work for a company that operates at such a scale that you can measure your own hardware failure rates easily. Perhaps you’re even a hardware manufacturer looking to learn about how you can translate failure data into more meaningful data for your customers!

If you have a lot of data, developing SLO targets for your hardware performance doesn’t really deviate from anything discussed elsewhere in this book. You need to figure out how often you fail, determine if that level is okay with the users of your hardware, and use that data to set an SLO target that allows you to figure out whether you’re going to lose users/customers due to your unreliability or not.

If your SLO targets tell you that you’re going to lose users, you need to immediately pivot your business to figuring out how to make your components more reliable or you are necessarily going to make less money.

The point is that even as the provider of the bottom layer of everything computers rely upon, you’re likely aware that you can’t be perfect. You cannot deliver hardware components to all of your customers that will function properly all of the time. Some of these components will eventually fail. Some will even be shipped in a bad state. Know this and use this knowledge to make sure you’re only aiming to prevent the correct amount of failures. You’ll never prevent 100% of them, so pick a target that doesn’t upset your users and that you won’t have to spend infinite resources attempting to attain.

Beyond just hardware

In an absolutely perfect world, all SLOs would be built from the ground up. Since anything that is dependent on another system cannot strictly be more reliable than the one it depends on, it would be most optimal if each dependency in the chain had a documented reliability target.

For example, power delivery is required for all computer systems to operate. So, perhaps the first step in your own reliability story is knowing how consistently reliable levels of electricity are being delivered to the racks that your servers reside in. Then you have to consider if those racks have redundant circuits providing power. Then you have you consider the failure rates of the power supply units that deliver power to the other components of your servers. This goes on and on.

Tip

Don’t be afraid of applying the concepts outlined in this book to things that aren’t strictly software-based services. In fact, remember from the Preface that this same approach can likely be applied to just about any business. Chapter 5 covers some of the ways in which you can use SLOs and error budgets to address human factors.

Choosing Targets

Now that we’ve established that you shouldn’t try to make your target too high, and that your target doesn’t have to be comprised of just the number nine many times in a row, we need to talk about how you can pick the correct target.

The first thing that needs to be repeated here is that SLOs aren’t SLAs—they aren’t agreements. When you’re working through this process, you should absolutely keep in mind that your SLO should encompass things like ensuring your users are happy and that you can make actionable decisions based upon your measured performance against this SLO; however, you also need to remember that you can change your SLO if the situation warrants it. There is no shame in picking a target and then changing it in short order if it turns out that you were wrong. All systems fail, and that includes humans trying to pick magic numbers.5

Past Performance

The best way to figure out how your service might operate in the future is studying how it has operated in the past. Things about the world and about your service will absolutely change—no one is trying to deny that. But if you need a starting point in order to think about the reliability of your service, the best starting point you’ll likely have is looking at its history. No one can predict the future, and the best alternative we have is extrapolating from the past.

Note

You may or may not want to discount previous catastrophes here. Severe incidents are often outlier events that you can learn important lessons from, but are not always meaningful indicators in terms of future performance. As always, use your own best judgment, and don’t forget to account for the changes in the robustness of your service or the resilience of your organization that may have come from these lessons learned.

All SLOs are informed by SLIs, and when developing your SLIs, it will often be the case that you’ll have to come up with new metrics. It will sometimes be the case that you might need to collect or export data in entirely new ways, but other times you might determine that your SLI is a metric you’ve already been collecting for some amount of time.

No matter which of these is true, you can use this SLI to help you pick your SLO. If it’s a new metric, you might have to collect it for a while first—a full calendar month is often a good length of time for this. Once you’ve done that, or if you already have a sufficient amount of data available, you can use that data about your past performance to set your first SLO. Some basic statistics will help you do the math.

Tip

Even if you have a solid grasp of basic statistics, you might still find value in reading about how to use these techniques within an SLO-specific context. Chapter 9 covers more advanced statistical techniques.

Basic Statistics

Statistical approaches can help you think about your data and your systems in incredibly useful ways, especially when you already have data you can analyze. We’ll go into much more depth on the math for picking SLO targets in various chapters in Part II, but this section presents some basic and approachable techniques you can use to analyze the data you have available to you for this purpose. For some services, you might not even need the more advanced techniques described in future chapters, and you might be able to rely mostly on the ones outlined here.

That being said, while we’ll tie basic statistical concepts to how they relate to SLOs in the next few pages, those who feel comfortable with the building blocks of statistical analysis can skip ahead to “Metric Attributes”.

The five Ms

Statistics is a centuries-old discipline with many different uses, and you can leverage the models and formulae developed by statisticians in the past to help you figure out what your data is telling you. While some advanced techniques will require a decent amount of effort to apply correctly, you can get pretty good insight into how an SLI is performing, and therefore what your SLO should look like, with basic math.

The building blocks of statistical analysis are five concepts that all begin with the letter M: min, max, mean, median, and mode. These are not complicated concepts, but they can give you excellent insight into time series–based data. In Table 4-3 you can see an example of a small time series dataset (known as a sample, indicating that it doesn’t represent all data available but only some portion of it).

Table 4-3. Time series sample
Time 16:00 16:01 16:02 16:03 16:04 16:05 16:06 16:07 16:08 16:09
Value 1.5 6 2.4 3.1 21 9.1 2.4 1 0.7 5

When dealing with statistics it’s often useful to have things sorted in ascending order, as shown in Table 4-4, so while the time window from which you’ve derived your sample is important for later context, you can throw it out when performing the statistics we’re talking about here.

Table 4-4. Time series sample in ascending order
Value 0.7 1 1.5 2.4 2.4 3.1 5 6 9.1 21

The min value of a time series is the minimum value observed, or the lowest value. The max value of a time series is the maximum value observed, or the highest value. These are pretty easily understood ideas, but it’s important that you use them when looking at SLI data in order to pick proper SLOs. If you don’t have a good understanding of the total scope of possibilities of the measurements you’re making, you’ll have a hard time picking the right target for what these measurements should be. Looking at Table 4-4, we can see that the min value of our dataset is 0.7 and the max value is 21.

The third M word you need to know is mean. The mean of a dataset is its average value, and the words mean and average are generally interchangeable. A mean, or average, is the value that occurs when you take the sum of all values in your dataset and divide it by the total number of values (known as the cardinality of the set). We can compute the mean for our time series via the following equation:

(0.7+1+1.5+2.4+2.4+3.1+5+6+9.1+21) 10 = 5 . 52
Note

There is nothing terribly complicated about computing a mean, but it provides an incredibly useful and simple insight into the performance of your SLI. In this example, we now know that even though we had a min value of 0.7 and a max value of 21, the average during the 10 minutes of data that we’re analyzing was 5.52. This kind of data can help you pick better thresholds for your SLOs. Calculating the mean value for a measurement is more reliable than looking at a graph and trying to eyeball what things are “normally” like.

The fourth M word is median. The median is the value that occurs right in the middle. In our case, we are looking at a dataset that contains an even number of values, so there is no exact middle value. The median of the data in situations like this is the mean of the two middle values. In our case this would be the 5th and 6th values, or 2.4 and 3.1, which have a mean of 2.75.

The median gives you a good way to split your data into sections. It’ll become clearer why that is useful when we introduce percentiles momentarily, but what should hopefully be immediately clear is that the mean for this data is higher than the median value. This tells you that you have more values below your average than you have above it—in this case, 7 values compared to 3—which lets you know that the higher-value observations happen less frequently, and that they contain outliers. Knowing about outliers can help you think about where to set thresholds in terms of what might constitute a good observation versus a bad one for your service. Sometimes these outliers are perfectly fine in the sense that they don’t cause unhappy users, and other times they can be indicative of severe problems, but at all times outliers are worth investigating more to know which category they fit into.

The fifth and final M word is mode. The mode of a dataset is the value that occurs most frequently. In our example dataset the mode is 2.4, because it occurs twice and all the other values occur only once. When no value occurs more than once, there is no mode. When multiple values occur at the same frequency, the dataset is said to be multimodal. The concept of counting the occurrences of values in a sample is very important, but is much better handled via things like frequency distributions and histograms, which are introduced in Chapter 9. The mode is only included here for the sake of completeness in our introduction to statistical terminology.

Ranges

Another important basic statistical concept is that of a range, which is simply the difference between your max value and your min value, which lets you know how widely distributed your values are. In our sample (Table 4-4), the min value is 0.7 and the max value is 21. The math to compute a range is just simple arithmetic:

21 - 0 . 7 = 20 . 3

Ranges give you a great starting point in thinking about how varied your data might be. A large range means you have a wide distribution of values; a small range means you have a slim distribution of values. Of course, what wide or slim could mean in your situation will also be entirely dependent on the cardinality of the values you’re working with.

While ranges give you a great starting point in thinking about how varied your dataset is, you’ll probably be better served by using the concept of deviations. Deviations are more advanced ways of thinking about how your data is distributed; Chapter 9 talks about how to better think about the distribution, or variance, of your data.

Percentiles

A percentile is a simple but powerful concept when it comes to developing an understanding of SLOs and what values they should be set at. When you have a group of observed values, a percentile is a measure that allows for you to think about a certain percentage of them. In the simplest terms, it gives you a way of referring to all the values that fall at or below a certain percentage value in a set.

For example, for a given dataset, the 90th percentile will be the threshold at which you know that all values below the percentile are the bottom 90% of your observations, and all values above the percentile are the highest 10% of your observations.

Using our example data from earlier, values falling within the 90th percentile would include every value except the 10th one. When working with percentiles you’ll often see abbreviations in the form PX, where X is the percentile in question. Therefore, the 90th percentile will often be referred to as the P90. If you wanted to isolate the bottom 50% of your values, you would be talking about values below the 50th percentile, or the P50 (which also happens to be the median, as discussed previously). While percentiles can be useful at almost any value, depending on your exact data, there are also some common levels at which they are inspected. You will commonly see people analyzing data at levels such as the P50 (the median), P90, P95, P98, and P99, and even down to the P99.9, P99.99, and P99.999.

When developing SLOs, percentiles serve a few important purposes. The first is that they give you a more meaningful way of isolating outliers than the simpler concept of medians can. While both percentiles and medians split your data into two sets—below and above—percentiles let you set this division at any level. This allows you to look at your data split into many different bifurcations. You can use the same dataset and analyze the P90, P95, and P99 independently. This kind of thinking allows you to address the concept of a long tail, which is where your data skews in magnitude in one direction or the other, but perhaps not with a frequency that is meaningful.

The second way that percentiles are useful in analyzing your data for SLOs is that they can help you pick targets in a very direct manner. For example, let’s say that you calculate the P99 value for a month of data about successful database transaction times. Once you know this threshold, you now also know that if you had used it as your SLO target, you would have been reliable 99% of the time over your analyzed time frame. Assuming performance will be similar in the future, you could aim for a 99% target moving forward as well.

Note

Calculating common percentiles based upon your current data is a great starting point in choosing an initial target percentage. Not only do they help you identify outliers that don’t reflect the general user experience, but your metrics data will sometimes simply be more useful to analyze with those outliers removed. Another common way to achieve these results and better understand your data involves histograms, which we’ll discuss in Chapter 9.

Metric Attributes

A lot of what will go into picking a good SLO will depend on exactly what your metrics themselves actually look like. Just like everything else, your metrics—and therefore your SLIs—will never be perfect. They could be lacking in resolution, because you cannot collect them often enough or because you need to aggregate them across many sources; they could be lacking in quantity, since not every service will have meaningful data to report at all times; or they could be lacking in quality, perhaps because you cannot actually expose the data in the exact way that you wish you could.

Chapter 1 talked about how an SLI that allows you to report good events and total events can inform a percentage. Though this is certainly not an incorrect statement, it doesn’t always work that way in the real world—at least not directly. While it might be simplest to think about things in terms of good events over total events, it’s not often the case that your metrics actually correlate to this directly. Your “events” in how this math works might not really be “events” at all.

Resolution

One of the most common problems you’ll run into when figuring out SLO target percentages revolves around the resolution of your metrics. If you want a percentage informed by good events over total events, what do you do if you only have data about your service that is able to be reported—or collected—every 10, 30, or 60 seconds?

If you’re dealing with high-resolution data, this is probably a moot point. Even if the collection period is slow or sporadic, you can likely just count aggregate good and total events and be done with it.

But not all metrics are high resolution, so you might have to think about things in terms of windows of time. For example, let’s say you want a target percentage of 99.95%. This means that you’re only allowed about 43 seconds of bad time per day:

( 1 - 0 . 9995 ) × 24 × 60 × 60 = 43 . 2

To work this out, you first subtract your target percentage from 1 to get the acceptable percentage of bad observations. Then, to convert this into seconds per day, you multiply that value by 24 (hours per day), then by 60 (minutes per hour), then by 60 again (seconds per minute).

In this case, you would exceed your error budget with just a single bad observation if your data exists at a resolution of 60 seconds, since you would immediately incur 60 seconds of bad time. There are ways around this if these bad observations turn out to be false positives of some sort. For example, maybe you need for your metric to be below (or above) a certain threshold for two consecutive observations before you count even just one of them against your percentage. However, this also might just mean that 99.95% is the wrong target for a system with metrics at that resolution. Changing your SLO to 99.9% would give you approximately 86 seconds of error budget per day, meaning you’d need two bad observations to exceed your budget.

Exactly what effect your resolution will have on your choice of reliability target is heavily dependent on both the resolution and the target at play. Additionally, the actual needs of your users will need to be accounted for. While we will not enumerate more examples, because they’re potentially endless, make sure you take metric resolution into consideration as you choose a target.

Quantity

Another common problem you could run into revolves around the quantity of your metrics. Even if you are able to collect data about your service at a one-second interval, your service might only have an event to report far less frequently than that. Examples of this include batch scheduler jobs, data pipelines with lengthy processing periods, or even request and response APIs that are infrequently talked to or have strong diurnal patterns.

When you don’t have a large number of observations, target percentages can be thrown for a loop very quickly. For example, if your data processing pipeline only completes once per hour, a single failure in a 24-hour period results in a reliability of 95.83% over the course of that single day. This might be totally fine—one failure every day could actually be a perfectly acceptable state for your service to be in—and maybe you could just set something like a 95% SLO target to account for this. In situations like this, you’ll need to make sure that the time window you care about in terms of alerting, reporting, or even what you display on dashboards is large enough to encompass what your target is. You can no longer be 95% reliable at any given moment in time; you have to think about your service as being 95% reliable over a 24-hour period.

Even then, however, you could run into a problem where two failures over the course of two entire days fall within 24 hours of each other. To allow for this, you either have to make the window you care about very large or set your reliability target down to 90%, or even lower. Any of these options might be suitable. The important thing to remember is that your target needs to result in happy users over time when you’re meeting it, and mark a reasonable point to pivot to discussing making things more reliable when you’re not.

For something like a request and response API that either has low traffic at all times or a diurnal (or other) pattern that causes it to have low traffic at certain times, you have a few other options in addition to using a large time window.

The first is to only calculate your SLO during certain hours. There is nothing wrong with saying that all time periods outside of working hours (or all time periods outside of 23:00 to 06:00, or whatever makes sense for your situation) simply don’t count. You could opt to consider all observations during those times as successes, no matter what the metrics actually tell you, or you could just ignore observations during those times. Either of these approaches will make your SLO target a more reasonable one, but could make an error budget calculation more complicated. Chapter 5 covers how to better handle error budget outliers and calculations in more detail.

The other option available to you is using some advanced probability math like confidence intervals. Yes, that is a complicated-sounding phrase. Yes, they are not always easy to implement. But don’t worry, Chapter 9 has a great primer on how this works.

Services with low-frequency or low-quantity metrics can be more difficult to measure, especially in terms of calculating the percentages you use to inform an SLO target, but you can use some of these techniques to help you do so.

Quality

The third attribute that you need to keep in mind for the metrics informing your SLIs and SLOs is quality. You don’t have quality data if it’s inaccurate, noisy, ill-timed, or badly distributed. You could have large quantities of data available to you at a high resolution, but if this data is frequently known to be of a low quality, you cannot use it to inform a strict target percentage. That doesn’t mean you can’t use this data; it just means that you might have to take a slightly more relaxed stance.

The first way you can do this is to evaluate your observations over a longer time window before classifying them as good or bad. Perhaps you measure things in a way that requires your data to remain in violation of your threshold in a sustained manner for five or more minutes before you consider it a bad event. This lowers your potential resolution, but does help protect against noisy metrics or those prone to delivering false positives. Additionally, you can use percentages to inform other percentages. For example, perhaps you require 50% of all the metrics to be in a bad state over the five-minute time window before you consider that you have had a single bad event that then counts toward your actual target percentage.

Note

Using the techniques we’ve described here, as well as those covered in Chapters 7 and 9, you can even make low-quality data work for you—though you just might need to set targets lower than what your service actually performs like for your users. As long as your targets still ensure that users are happy when you exceed them and aren’t upset until you miss them by too much for too long, they’re totally reasonable. It doesn’t matter what the percentage actually is. Don’t let anyone trick you into thinking that you have to be close to 100% in your measurements. It’s simply not always the case.

Percentile Thresholds

When choosing a good SLO, it’s also important to think about applying percentile thresholds to your data. It’s rarely true that the metrics that you’re able to collect can tell you enough of the story directly. You’ll often have to perform some math on them at some point.

The most common example of this is using percentiles to deal with value distributions that might skew in one direction or the other in comparison with the mean. This kind of thing is very common with latency metrics (although not exclusive to them!), when you’re looking at API responses, database queries, web page load times, and so on. You’ll often find that the P95 of your latency measurements has a small range while everything above the P95 has a very large range. When this happens, you may not be able to rely on some of the techniques we’ve outlined to pick a reliable SLO target.

Let’s consider the example web page load time again, since it’s a very easy thought experiment everyone should be familiar with. You’re in charge of a website, and you’ve done the research and determined that a 2,000 ms load time keeps your users happy, so that’s what you want to measure. But once you have metrics to measure this, you notice that your web pages take longer than 2,000 ms to load a full 5% of the time, even though you haven’t been hearing any complaints. You could just set your target at 95%—and this might not be the wrong move—but you gain a few advantages by using percentiles instead. That is, you could say that you want web pages to load within 2,000 ms at the 95th percentile, 99.9% of the time.

The primary advantage of this approach is that you can continue to monitor and care about what your long tail actually looks like. For example, let’s say that your P95 observed values generally fall below 2,000 ms, your P98 values fall below 2,500 ms, and your P99 values fall below 4,000 ms. When 95% of page loads complete within 2 seconds, your users may not care if an additional 4% of them take 4 seconds; they may not even care if 1% of them take 10 seconds or time out entirely. But what users will care about is if suddenly a full 5% of your responses start taking 10 seconds or more.

By not just caring about the bottom 95% of your latency SLI by setting an SLO target of 95%, and instead caring about a high percentage of your P95 completing quickly enough, you free yourself up to look at your other percentiles. Based on the preceding examples, you could now set additional SLOs that make sure your P98 remains below 2,500 ms and your P99 remains below 4,000 ms. Now you have three targets that help you tell a more complete story, while also allowing you to notice problems within any of those ranges independently, instead of just discarding some percentage of the data:

  1. The P95 of all requests will successfully complete within 2,000 ms 99.9% of the time.

  2. The P98 of all requests will successfully complete within 2,500 ms 99.9% of the time.

  3. The P99 of all requests will successfully complete within 4,000 ms 99.9% of the time.

With this approach, you’ll be able to monitor if things above the 95th percentile start to change in ways that make your users unhappy. If you try to address your long tail by setting a 95% target, you’re discarding the top 5% of your observations and won’t be able to discover new problems there.

Another advantage of using percentiles is a reporting one. This book preaches an idea that may have been best summarized by Charity Majors: “Nines don’t matter if users aren’t happy.” While this is true, a long tail accommodated by a lower target percentage can be misleading to those newer to SLO-based approaches. Instead, you can use percentiles to make your written SLO read more intuitively as a “good” one. You shouldn’t set out to purposely mislead anyone, of course, but you can always choose your language carefully so as to not alarm people.

What to Do Without a History

What should you do if you’re trying to set SLOs for a service that doesn’t yet exist, or otherwise doesn’t have historical metrics for you to analyze? How can you set a reasonable target if you don’t yet have any users, and therefore may not know what an acceptable failure percentage might look like for them? The honest answer is: just take an educated guess!

SLOs are objectives, they’re not formal agreements, and that means you can change them as needed. While the best way to pick an SLO that might hold true into the future is to base it on data you’ve gathered, not every SLO has to hold true into the future. In fact, SLO targets should change and evolve over time. Chapter 14 covers this in depth.

There may be other sources of data you can draw upon when making your educated guess—for example, the established and trusted SLOs of services that yours might depend on, or ones that will end up depending on yours. As you’ve seen, your service can’t be more reliable than something it has a hard dependency on, so you need to take that into account when picking an initial target.

It’s also true that not every service has to have an SLO at launch—or even at all. An SLO-based approach to reliability is a way of thinking: are you doing what you need to do, and are you doing so often enough? It’s about generating data to help you ask those questions in a better way. If your service doesn’t yet have the metrics or data you need to be able to ask those questions in a mathematical way, you can always make sure you’re thinking about these things even without that data.

However, there are ways to make sure you’re thinking about SLIs and SLOs as you architect a service from the ground up. Chapter 10 discusses some of these techniques.

Summary

Every system will fail at some point. That’s just how complex systems work. People are actually aware of this, and often okay with it, even if it doesn’t always seem obvious at first. Embrace this. You can take actions to ensure you don’t fail too often, but there isn’t anything you can do to prevent every failure for all time—especially when it comes to computer systems.

But if failures don’t always matter, how do you determine when they do matter? How do you know if you’re being reliable enough? This is where choosing good SLOs comes in. If you have a meaningful SLI, you can use it to power a good SLO. This can give you valuable insight into how reliable your users think you are, as well as provide better alignment about what “reliable” actually means across departments and stakeholders. You can use the techniques in this chapter to help you start on your journey to doing exactly that.

1 Think, for example, about Hyrum’s law, discussed in Chapter 2. There’s also a great story about Chubby, a dependency for many services at Google, in Chapter 5.

2 Trying to be 99.999% reliable over time means you can be operating unreliably for less than one second per day and only about 5 minutes and 15 seconds over the course of an entire year (Chapter 5 discusses how to do this math). This is an incredibly difficult target to reach. Even if your services are rock solid, everything depends on something else, and it is often unlikely that all of those dependencies will also consistently operate at 99.999% for extended periods of time.

3 These numbers were calculated via an excellent tool written by Kirill Miazine and are based upon calculations that assume a year has 365.25 days in order to account for leap years.

4 Actually identifying all of your dependencies, however, is a complicated ordeal. This is why you need to measure the actual performance of your service, and not just set your targets based upon what your known dependencies have published. You almost certainly have dependencies you don’t know about, and your known dependencies aren’t all going to have perfectly defined SLOs.

5 Chapter 14 discusses how to evolve your SLO targets in great detail.

Get Implementing Service Level Objectives now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.