Chapter 1. Define Your Objectives

Defining your service objectives will help ensure your system satisfies users while minimizing costs. Skip this step and you might design a system that is more complex than it needs to be, takes more engineers to implement, costs more to operate, or doesn’t meet the needs of your users. Your objectives will influence the choice of dependencies, architectures, and means of operation.

Objectives you should define for each service include:

  • Service level indicators (SLIs), service level objectives (SLOs), and service level agreements (SLAs) to set performance goals and set expectations for users

  • Failure domains and redundancy so that failures don’t result in service outages

  • Scalability and efficiency goals so the service can grow while remaining cost effective

  • Speed of change so the service remains relevant to users and your business

SLIs, SLOs, SLAs

SLIs, SLOs, and SLAs are a structure for formally defining your reliability goals as numerical measures that you can use to prioritize engineering investment, evaluate system design trade-offs, enable effective system designs across system boundaries, and reach meaningful agreements between independent parties. These measures should cover all the needs of your users and business.

SLIs are the metrics that are measured, stored, and analyzed. Common SLIs include availability, latency, freshness, quality, and durability. SLIs also define where the measurement occurs. Measure request performance from the client to capture the user’s full experience inclusive of client behavior and network performance, and also measure from the server to reduce noise and tighten evaluation of components fully under the service’s control. SLOs combine an SLI with a threshold objective over a measurement window. Table 1-1 describes common SLIs and example SLOs.

Table 1-1. Common SLIs and example SLOs
SLI name SLI definition SLO example
Availability Availability is the ratio of successful requests to total requests. A system is also considered unavailable if response latency is too high to be practically useful. 99.9% of requests succeed within a minute over a rolling quarter.
Latency Latency is the time it takes to receive a response and is measured as a distribution and evaluated with statistics like average, median, 90th percentile, 95th percentile, and 99th percentile. 99% of responses were within 100ms over a rolling quarter.
Freshness Freshness measures the staleness of a response, or the duration for which writes might not be reflected in a read, with the same statistical measures as latency. Freshness is measured for systems that aren’t strongly consistent. 99% of responses had data with staleness under 60s over a rolling quarter.
Quality Quality is a measure of the contents of a response, like the percentage of items returned in a set. Quality is measured for systems that can gracefully degrade, for example, by skipping results stored on servers that are currently unavailable. 99% of responses had 90% of the items over a rolling quarter.
Durability Durability is the ratio of readable data to all data previously written. 99.999999999% of stored data is readable.

An SLO can also be described as an error budget and burn rate. For example, a service with an availability SLO of 99.9% over a quarter can be 100% down for about 2 hours or 10% down for about 21 hours. A service with an availability of 99.999% over a quarter can be 100% down for about 1 minute or 10% down for about 12 minutes. Humans typically cannot respond and mitigate faster than one hour, so budgets under an hour require an automated response. Figure 1-1 shows error budgets for various burn rates and SLOs.

Figure 1-1. Error budgets for various error rates and SLOs over a quarter.

Choose SLO thresholds and windows based on user and business needs. Objectives that underperform stakeholder needs will make them unhappy or break obligations, whereas objectives that overperform will result in unnecessary engineering investment, more complex systems, and reduced flexibility to evolve the system. If you find that stakeholders complain even if the system is meeting SLO, then it’s a good indicator that your SLO is too weak; if stakeholders don’t complain when the system fails to meet SLO, then it’s a good indicator your SLO is too strong. When a system fails to meet SLOs, expect to invest engineering resources to improve the system performance; if the system is meeting SLOs, then expect to save that investment. Avoid setting your objectives based solely on the system’s current performance, a performance that may not satisfy your stakeholders, and one that may unnecessarily constrain the implementation to what’s already built. For example, if you set your latency SLO based on current performance but faster than necessary, you may be unable to add new features that would increase latency and violate the SLO. If you later change the SLO, you may break assumptions that consumers had made and cause downstream SLO violations.

SLOs may be specialized per API, per method, or per feature if there are meaningful differences in user and business needs. For user-facing services, it’s common to use different objectives for critical and optional functionality. For example, a blogging service may want a higher availability for reading blogs than for posting new entries. For infrastructure services, it’s common to use different objectives for the data plane (the functionality involved in serving end-user requests like reading from a database) and for the control plane (the functionality involved in administering the infrastructure like turning up a new replica in a new region). Figure 1-2 shows common SLOs based on layers in the tech stack and classes of functionality.

Figure 1-2. Common availability SLOs based on layer in the tech stack and class of functionality.

You can evaluate SLOs for the service as a whole or for each individual consumer. Evaluating the SLO for each consumer will prevent a poor experience in isolation, like 100% unavailability for their requests despite the service as a whole having sufficient availability. The more granular the evaluation, the more susceptible the statistic will be to noise and the more expensive the evaluation due to cardinality. For example, a service may have one million requests in a quarter with which to compute the 99th percentile latency, but for a single consumer who had one request, the 99th percentile becomes the same as the max latency for that consumer. Evaluating the SLO for each consumer may also put new constraints on the architecture. For example, evaluating the service as a whole may put constraints on general database availability, whereas evaluating the SLO for a single consumer may put constraints on availability of specific database rows. Similar to other SLO parameters, the choice of granularity should be based on user and business needs.

Publish SLOs to set performance expectations with users and to document what level of performance can be relied upon. If you don’t publish the SLOs, users will set an implicit SLO based on their needs and observed performance, which may have consequences when the service fails to meet those implicit expectations. For example, users may become upset with the service and demand higher performance even though such was never an objective for the service, and may cause downstream SLO violations due to broken assumptions. Similarly, when designing your architecture, select and use dependencies based on their published SLOs so that the entire system performs as desired. For example, you may choose a less featureful database that comes with higher availability SLOs so that the database does not limit your service’s ability to meet SLOs. Later in this report, we’ll learn to compose dependencies to increase aggregate reliability.

SLAs combine an SLO with a consequence should the SLO be violated, and are used between independent parties with a business relationship. For example, a cloud service provider may provide credits or refunds if virtual machine availability does not meet the objective documented in the agreement. SLAs can come with financial and reputational damage, so the objectives in SLAs tend to be more conservative than the equivalent internal SLO. For example, an SLA may give credits for failing to meet 99.9% availability even though the internal SLO targets 99.99% availability. For service owners, publishing a weaker objective in an SLA risks consumers developing an implicit SLO based on observed performance similar to as if no SLO or SLA were published. For consumers, receiving weaker objectives makes it difficult to leverage the service in other applications, which must in turn be designed to meet SLO targets. When SLAs are too weak to be theoretically useful, consumers must either demand stronger objectives from their providers, accept the risk that the dependency may underperform assumptions, or avoid the service in question altogether.

For more on SLIs, SLOs, and SLAs, see “Service Level Objectives” by Chris Jones et al. in Site Reliability Engineering.

Failure Domains and Redundancy

A failure domain is a group of resources that can fail as a unit, making services deployed within that unit unavailable. To achieve availability objectives you must define failure domains and redundancy requirements to prevent failures from creating outages that are too large. Services must gracefully handle failures in hardware and software. Software failures can include new code bugs, misconfiguration, contention, or the inability to leverage additional resources to handle more load. Hardware failures in an on-prem datacenter can include utility power being cut to the datacenter, top of rack switches eliminating connectivity to hosts in the rack, hard drives rendering data inaccessible, CPUs corrupting requests and data, or request load exhausting available resources.

Define failure domains by ensuring there are no shared dependencies across the domains. The blast radius of a failure, or the things impacted by a failure, is thereby limited to the failure domain. For example, the blast radius of a CPU failure is a single host because multiple hosts don’t depend on a single CPU. Mitigate failures of a domain by creating redundancy. For example, a service can be deployed across multiple hosts so that when a CPU fails, the other hosts can still host the service. For redundancy to be effective the replicas must be in separate failure domains and fail independently of each other. For example, having a second CPU on a host does not add redundancy because the failure of one CPU can affect the other CPU.

In a cloud environment, the hierarchy of failure domains for an application developer consists of virtual machines, hosts (physical machines), zones, and regions. Application developers should deploy services redundantly across these failure domains. When using an orchestration system like Kubernetes, the corresponding hierarchy of failure domains would be pods, nodes, clusters, and regions. Table 1-2 shows common hardware failure domains and opportunities to create redundancy.

Table 1-2. Common hardware failure domains and associated redundancy
Failures Failure domain Infrastructure redundancy (cloud responsibility) Application redundancy (your responsibility)
CPU, RAM, disk Host Multiple hosts
in a datacenter
Multiple servers and data replicas that can be distributed over multiple hosts
Top of rack switch, power rectifier, battery backup Rack Multiple racks
in a datacenter
Aggregation and spline switches, row transformer, diesel backup generator Row Multiple rows
in a datacenter
WAN routers, machine fire, cluster schedulers Cluster Multiple clusters within a datacenter / campus Multiple servers and data replicas distributed over multiple zones
Exterior fiber optics, power utility, earthquake, hurricane, tornado Datacenter / Campus Multiple datacenters
with geographic dispersion
Multiple servers and data replicas distributed over multiple regions

A failure within a domain can impact all components within that domain. Larger failure domains impact more components and can thereby create larger outages. For example, a failure of the rack’s power supply will take down all hosts in the rack, whereas a failure of the datacenter’s substation will take down all hosts in the datacenter. Similarly, a failure in a configuration loaded by every server will take down all servers, whereas a failure in a configuration loaded by only a single server will take down that server but not others.

Eliminate shared dependencies in larger failure domains by pushing those dependencies into the smaller failure domains, which have smaller blast radiuses and higher redundancy. For example, a configuration loaded from a global database is a dependency shared by all servers globally, which risks taking down all server instances globally. Instead, loading the configuration from parameters that can be set independently on each server ensures that updating the flags for one server doesn’t impact others, thereby reducing the blast radius of the configuration. The intended state of the flags can be managed from one location, but rollout of a new flag value should be incremental and gradual by updating servers independently and staging the rollouts to gradually increase the percentage of hosts that have received the update. Figure 1-3 visually demonstrates creating smaller blast radiuses for a configuration.

Figure 1-3. Eliminate shared dependencies to create smaller blast radiuses.

Minimize critical dependencies that can cause the service to be unavailable. Prefer soft dependencies that can be used when available and avoided otherwise, thereby ensuring the dependency does not impact the availability of the service. For example, a synchronous blocking remote procedure call (RPC) to a logging service would be a critical dependency, whereas an asynchronous nonblocking RPC, for which errors are monitored but otherwise ignored, would be a soft dependency. By switching to a soft dependency, the availability of the service is no longer dependent on the state of the logging service.

Set the redundancy based on the number of instances necessary to host the service (N) and the number of failures the service should survive (F), or N + F total instances. It’s typical to target surviving two failures (F = 2) so the service can survive one planned and one unplanned failure. A planned failure is done intentionally, like shutting down a host to upgrade the RAM. An unplanned failure is one that is unexpected, like the disk failing. For services that can operate and upgrade without planned failures, it may be sufficient to reduce redundancy to handle one failure to save the cost of an instance. Monitor the health of instances so that failed instances can be repaired or replaced to bring the service back to target redundancy.

You can reduce the overhead and redundancy by varying the number and size of instances. If one full-sized server would be 75% utilized when serving all traffic and you want to survive two failures, then you’ll need three servers. When all servers are operational, each server will be 25% utilized with 75% of each server idle. Upon failure, the load from the failed servers will move to the healthy server, resulting in a single server that is 75% utilized. Figure 1-4 shows how load is redistributed after a server failure. If instead you used four quarter-sized servers and two additional quarter-sized servers to handle failures, then each server will be 50% utilized during normal operation and 75% utilized after two failures. By going from N = 1 to N = 4, we’ve reduced the overhead of N + 2 from 66.7% to 33.3%. Figure 1-5 shows how overhead varies for different redundancies and number of instances. Note that too many instances can be more expensive if reducing instance size impacts efficiency due to instance overhead and the cost of instance coordination.

Figure 1-4. When failure occurs, load is shifted from failed to healthy instances. If there are three instances at 25% utilization each, after two failures the remaining instance will be 75% utilized.
Figure 1-5. Overhead versus instances for varying levels of redundancy. Increasing the number of instances while reducing the size of an instance decreases the overhead of redundancy.

Where possible, minimize dependencies shared between servers, and minimize dependencies shared between regions. Prefer to have a dependency between servers within a region, rather than a dependency across regions. Redundancy for servers and regions is typically set at N + 2, unless there are no planned failures and there are significant cost savings with N + 1.

Scalability and Efficiency

As usage of a service grows, the service must grow in size to accommodate the additional load. That is, as queries per second and bytes stored grow, the service should be able to leverage additional computational and storage resources like CPU and disks respectively, and the service should have a process for introducing new resources. If a service fails to scale, the service may become overloaded and unhealthy, start serving errors, and ultimately not achieve SLOs.

Additional servers can make a service more reliable, but with profitability comes the opposing force of efficiency. Without considering profitability, a service could simply provision infinitely to eliminate the problem of overload. However, most services are subject to business profitability objectives, so services must be efficient with resources. As a service becomes more expensive, businesses typically increase the target utilization to reduce the cost of wasted resources.

The algorithms, data structures, and architectures used should scale at most linearly, or O(N), with service usage. Prefer solutions that scale sublinearly, like O(1) or O(logN). Polynomial solutions, or O(N^a) for a>1, will quickly become prohibitively expensive and fail to scale. For example, an O(N^2) operation that takes 100 milliseconds for 10 users will take over 16 minutes with 1,000 users, and will take over 30 years with 1 million users. Solutions that are O(NlogN) may scale but may be expensive or underperform, so those operations are typically restricted to control plane or noncritical path operations. Figures 1-6 and 1-7 compare the performance of different algorithms as data or the number of users scale.

Figure 1-6. Scaling factor for cost or performance for different asymptotic complexities.
Figure 1-7. O(N^2) scaling quickly underperforms and becomes prohibitive.

Services should support both vertical and horizontal scaling to leverage additional resources and handle more load. Vertical scaling involves making existing components larger, which tends to improve or maintain efficiency, but fails to scale beyond a practical limit like max host or disk size. Horizontal scaling involves adding additional instances of a component without changing the size of a component, which tends to reach much larger scales, but can reduce efficiency due to increased overhead and coordination. The Kubernetes Horizontal Pod Autoscaler is an example of horizontal scaling. Figure 1-8 shows visually the difference between vertical and horizontal scaling. With scaling as an objective, service design must consider details such as how a server can leverage additional CPUs, how work can be distributed across servers, and how state can be synchronized. Services must also avoid bottlenecks to scaling like contention. We’ll dive deeper on scaling in “Horizontal and Vertical Scaling”.

Figure 1-8. Services can leverage more resources by scaling horizontally by adding additional server instances, or by scaling vertically by creating bigger servers.

Consider both utilization and the cost of work. Utilization measures the percentage of allocated resources actively being used to do useful work. For example, if a service is provisioned with 10 CPUs but only uses 1 CPU to serve traffic then the service has 10% utilization. If the service was resized to use 2 CPUs then utilization would increase to 50% and cost would be reduced by 80%. Work cost is the resources necessary to complete a logical unit of work like responding to an HTTP request. Work cost can be measured through metrics like requests per second per CPU. Services may be able to reduce work cost through techniques like caching. Services typically target a utilization between 50% and 80%, whereas work cost targets are domain specific. Services with large periods of inactivity may leverage scale-to-zero, or shutting down all server instances, to eliminate cost during inactivity to achieve efficiency objectives.

Evolution and Velocity

Service requirements change over time due to changing user and business needs, and services must evolve to satisfy these new requirements or be rendered obsolete. Services must also evolve rapidly so that users are quickly satisfied and the service remains competitive. Design services to safely evolve at high velocity by creating environments, continuously deploying changes, updating production gradually, and designing for forward and backward compatibility.

Services typically target hourly deployments to a development environment, daily deployments to a staging environment, and weekly deployments to a production environment as part of an agile model of development where improvements are regularly delivered into production. Integrate and deploy components of a feature continuously as they are developed to minimize merge conflicts and the size of any change deployed. To maintain productivity, teams must invest in automated testing, qualification, and deployment. For detailed guidance, see “DevOps Tech: Continuous Delivery”.

For safety, you must gradually deploy changes to the production environment so that problems can be detected and mitigated prior to becoming a full service outage; therefore, services should roll out a component change gradually over a week. To deploy gradually over a week while maintaining a weekly cadence, component changes are rolled out in parallel. Changes to different components must be decoupled, and individual components must maintain backward and forward compatibility between adjacent releases.

Evolution across services or with user involvement tends to be more difficult and much slower. Services typically assume breaking changes must maintain backward compatibility for years so that users can upgrade in time without disruption. For changes that don’t require user work and just require redeployment, services typically assume new versions will be deployed within months. Due to the slow velocity, developers tend to be more intentional with externally visible API and feature changes, performing deeper review, and being conservative where possible. Due to the difficulty and cost of breaking changes, developers tend to design with backward and forward compatibility in mind.

Recap

In this chapter, we’ve defined the objectives for a service:

Service level indicators (SLIs), objectives (SLOs), agreements (SLAs)
  • Measure, store, and analyze metrics for availability, latency, freshness, quality, and durability. Measure metrics both at the client side and at the server side.

  • Choose SLO thresholds and windows based on user and business needs, not based on current system performance.

  • Publish SLOs to set expectations with users and consumers.

  • Use an SLA between independent parties with a business relationship to set expectations and the consequences for underperformance.

Failure domains and redundancy
  • A failure domain is a group of resources that can fail as a unit.

  • Create redundancy by leveraging multiple servers, zones, and regions.

  • Eliminate shared dependencies in larger failure domains by pushing them into smaller domains with smaller blast radiuses and higher redundancy.

  • Minimize critical dependencies and prefer soft dependencies.

  • Set the redundancy based on the number of instances necessary to host the service (N) and the number of failures the service should survive (F), or N + F total instances.

Scalability and efficiency
  • Algorithms, data structures, and architectures should scale at most linearly O(N). Prefer solutions which scale sublinearly, like O(1) or O(logN).

  • Services should support both scaling vertically (larger servers) and horizontally (more servers).

Evolution and velocity
  • Target hourly deployments to a development environment, daily deployments to a staging environment, and weekly deployments to a production environment.

  • Integrate and continuously deploy components of a feature as they are developed.

  • Gradually deploy changes to the production environment over a week.

  • Ensure components have backward and forward compatibility between adjacent releases.

  • Assume breaking API changes must maintain backward compatibility for years.

Next, we’ll explore the building blocks we’ll leverage to build a service that meets those objectives.

Get Building Reliable Services on the Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.