Chapter 4. Avoid Common Failure Modes

Resilient services must defend against common failure modes like deploying a bad change, error handling exacerbating a problem, resource exhaustion, thundering herds1 and hotspots creating overload, and data corruption losing data. This chapter introduces these common failure modes and discusses techniques you can use to defend against them.

Bad Changes

At Google, the most common trigger of an outage is the deployment of a bad change. For example, in one major outage, an error introduced in a load-balancing configuration was rapidly deployed to all load balancers, causing them to establish additional backend connections, run out of memory, and disrupt global network traffic. In another major outage, a quota management system received an erroneous usage report, which resulted in a global resizing of quota, resulting in Paxos write failures that led to Paxos log staleness, bounded-staleness read failures, authentication failures, and global service disruption. To defend against these outages, Google has established a set of change management principles shown in Table 4-1.

Table 4-1. Change management principles
Principle Description
Change Supervision Monitor services to detect faults, including dependencies and dependents for end-to-end coverage.
Progressive Rollout Make changes slowly across isolated failure domains so issues can be detected and mitigated before having a large impact.
Safe & Tested Mitigations Have rapid, low-risk, and tested mitigations like change rollback and automatically execute them on issue detection.
Defense in Depth Independently verify correctness of changes at each layer of the stack for multiple defenses against failure.

Supervise changes by monitoring instrumentation of the components. Monitoring must be fine-grained enough to detect anomalies in a new version when it has only been rolled out to a small fraction of clients, servers, or systems. Time-series metrics can provide indicators like server crash rate, request error ratio, and request latency. Probers can automatically evaluate a service end to end by periodically executing a test that exercises the API the way a consumer would. Use metrics and probers to detect performance deviations and unhealthiness. For example, you can detect that a change notably increased latency by 10% or produced a high error rate of 1%. Change management systems must monitor these signals while rolling out a change to detect and mitigate an issue. Once an issue is detected, you’ll want further instrumentation like dashboards, logging, tracing, and profiling in order to identify, mitigate, and prevent the contributing factors.

Rollout changes progressively by creating independent and small failure domains and then changing those failure domains one at a time. If the change is bad it’ll cause the domain to fail, but the other domains will remain healthy. Change management systems should detect issues in the updated domain and halt the rollout before many domains are broken. It’s typical to create independent failure domains at the granularities of server, zone, region, and continent. For global services, it’s also common to create independent failure domains at data or service boundaries, like sharding users across 10 independent instances of the service. Resilient services are typically rolled out over a week, waiting at least several minutes between server updates. Regions are typically grouped into waves of 0.1%, 1%, 10%, 50%, and 100%, where waves are spaced by one business day to expose issues tied to daily traffic cycles and global demand fluctuations. Figure 4-1 shows an example gradual rollout schedule over a week.

Figure 4-1. Example gradual rollout schedule.

Automatically apply safe and tested mitigations, like change rollback, once an issue has been detected in order to reduce the duration of outages and minimize the error budget exhausted. For example, an automatic rollback can reduce repair time from a human response time of hours to a machine response time of minutes. For automatic rollbacks to be safe, the service and the changes must be forward and backward compatible between adjacent versions.

Verify service configuration for correctness and sanity. For example, both the service as well as systems that store and propagate configuration changes should detect and alert when configuration is empty, partial or truncated, corrupted, logically incorrect or unexpected, or not received within expected time. The service should gracefully degrade by continuing to operate in the previous state, or in a fallback mode, until the bad input can be corrected. Operational tools should reject invalid configurations, configurations that change too much from the previous version, and potentially destructive changes (e.g., revoke all permissions from all users). Source control should similarly evaluate changes with required pre-merge checks. Changes deemed risky should only be applied if operators use emergency overrides like a command-line flag, a configuration option, and disabling pre-merge checks.

Deploy isolated environments for development (dev), staging, and production (prod). Changes will be deployed first to dev, evaluated in dev, deployed to staging, evaluated in staging, and finally deployed to prod. By unit, integration, and load testing changes in dev and staging before prod, issues can be detected and mitigated before reaching end users. The dev environment’s purpose is to rapidly test and iterate on changes in a realistic setup. The staging environment will act as a dry run for updating production, and will also be a more stable environment than dev with which to test the interactions with dependencies and dependents before reaching prod. Figure 4-2 shows the coupling between services across environments.

Figure 4-2. Coupling between services across environments.

Finally, use infrastructure-as-code (IaC) and multiparty authorization (MPA) to safely make changes to production. IaC ensures that infrastructure configuration changes undergo review, are evaluated with pre-merge checks, and can be versioned and deployed similar to code. Multiparty authorization of commands can ensure one-off processes also undergo review to avoid common mistakes like typos.

Error Handling

How errors are handled can improve the reliability of the service by mitigating issues that would otherwise be exposed to consumers. Error handling can also exacerbate a problem and become the cause of an outage. Ensure services and clients generate appropriate errors and are well behaved in the presence of errors.

Gracefully degrade in the presence of errors by leveraging soft dependencies and having fail-safe alternatives. For example, a shopping page that contains reviews could be rendered without those reviews if the backend storing reviews is returning errors, enabling the customers to browse and checkout even if the experience isn’t ideal. Similarly, if there are 100 backends serving reviews but only a subset respond, the page can still be rendered with partial results, ensuring the customer gets some benefit from the review system. As another example, a load balancer distributing traffic based on server utilizations can fall back to round robin if the system gets bad utilization data, ensuring the system is able to serve some traffic even if load distribution isn’t uniform.

Services must return appropriate error codes and clients must be well behaved by responding appropriately to those error codes. Certain error codes like UNAVAILABLE indicate an issue may be transient and can be safely retried. Retry these errors across backends to increase the probability of requests succeeding so issues are mitigated and consumers are not impacted.

Error codes for overload or resource exhaustion should differentiate between server overload, service overload, and quota exhaustion. Server overload should be retried as there’s a chance the request may succeed on another server. For the other error codes, immediate retries are unlikely to succeed so short-deadline traffic like user requests should not be retried. However, batch traffic, which can afford to wait, can mitigate longer issues by retrying after a backoff delay. Table 4-2 summarizes the overload errors and when a client should retry the request.

Table 4-2. Overload error codes and retry conditions
Overload error Error code Retry for user traffic Retry for batch traffic with a delay
Server UNAVAILABLE Yes Yes
Quota RESOURCE_EXHAUSTED No Yes
Service INTERNAL No Yes

Error codes are usually propagated directly to upstream systems. However, if retrying a request fails multiple times due to local server overload, the propagated error should be a service overload to prevent retry amplification by upstream components, which could further overload the system. Figure 4-3 shows a request flow where a service retries an UNAVAILABLE error three times before returning an INTERNAL error to the consumer. If the service had instead propagated the UNAVAILABLE error, the upstream consumer may have itself retried three times leading to nine total requests to the dependency servers. Those extra six requests from retry amplification may further overload the dependency.

Figure 4-3. Preventing retry amplification by propagating an INTERNAL error.

Clients must slow down the sending of requests (backoff) when the service is overloaded to limit the cost of rejected requests and ensure the service isn’t pushed into cascading failure. The TCP network protocol leverages exponential backoff to ensure the connection remains healthy and data isn’t sent faster than it can be received. While exponential backoff will also work at the application layer for sending RPCs, it is commonly misimplemented to back off in the scope of an individual request rather than the connection between processes. Consider Figure 4-4, which shows the rate a client will send requests over time with different sending strategies. In this scenario, the backend has capacity for 2k QPS (dotted line) and the client is attempting to send 10k QPS (blue line), yielding 8k QPS of overload errors. A client that retries each throttled request up to three times (red line) will send 26k QPS to the backend, yielding further overload. A client that leverages per-request exponential backoff (yellow line) will spread a single request’s load over a longer period, potentially mitigating transient issues, but will still send 26k QPS and fail to reduce load on the backend. A technique called adaptive throttling (green line) throttles RPCs based on observed throughput and a target error ratio. This tunable target error ratio balances overhead of rejected requests with the speed at which the client adapts to a change in backend capacity. For more on adaptive throttling, see “Handling Overload” by Alejandro Forero Cuervo in Site Reliability Engineering.

Figure 4-4. Adaptive throttler versus per-request exponential backoff.

Backoff through delays/sleeping should add jitter, or random amounts of additional delay to each retried request. Without jitter, multiple concurrent requests with the same delay may inadvertently synchronize retries and create an instantaneous traffic spike leading to overload. By adding jitter, retry traffic will be smoothed out and load spread over time.

API parameters must be validated and sanitized so that erroneous, random, or malicious inputs cannot cause service outages or security breaches. Defend in depth by validating inputs at each layer of the stack and within each component of code. Explicitly test error handling to validate error codes generated and behavior in edge cases. Test for unexpected scenarios through fuzz testing by intentionally calling service APIs with random, empty, or too-large inputs, which will expose insufficient parameter validation.

Attach deadlines or timeouts to requests and check if these have been exceeded at multiple layers of the system and call stack to prevent problematic requests from exhausting resources. Deadlines are absolute timestamps specifying when processing must complete, and timeouts are durations specifying how long processing is allowed to take. Deadlines and timeouts ensure that an unhealthy request or system eventually drops the request even if the request never generates an error. This prevents a request from being processed indefinitely, which may contend resources until the server is restarted. Propagate deadlines and timeouts across RPCs so that downstream components stop processing requests that the upstream caller has already abandoned. For example, if a service uses a 100ms timeout to process a request but a backend takes 200ms, the backend should instead abort the request after 100ms as the service will have already moved on and subsequently will ignore any response after that point.

Resource Exhaustion

Resource exhaustion occurs when the load on a service exceeds the resources provisioned. Load on a service can increase due to organic growth like the gradual adoption by more users each month, or due to events like the release of a new feature that suddenly increases user demand, the release of a new version that contains a performance degradation, or an infrastructure event like a cache flush or the start of a batch job.

Two particularly risky failure modes are cascading failures and capacity caches. Cascading failure is where the overload of a single component causes the load to shift to another component, which subsequently becomes overloaded and fails, repeating until all components have failed and the system is unable to recover. Figure 4-5 shows how cascading failures occur after the overload of a single component. For more on cascading failures, see “Addressing Cascading Failures” by Mike Ulrich in Site Reliability Engineering. Capacity caches are where the provisioning of the system depends on the efficiency gains of a cache, so if the cache underperforms due to a change in traffic or data distribution, the system as a whole is underprovisioned and resources become exhausted. These risky scenarios can be mitigated by dropping traffic until the system recovers or by provisioning additional resources, and these scenarios can be prevented by not introducing caches to solve capacity problems, by testing the system in overload, by testing the system without the effect of the cache via cache flushes, and by the defensive techniques discussed in this section.

Figure 4-5. Cascading failure due to resource overload.

Services should gracefully degrade in such a way that the degradation maintains SLOs for traffic that fits within prior provisioning and only returns errors for the excess load. Defend against resource exhaustion through:

Cost modeling

Measure resource usage and predict the cost and volume of work.

Load shedding

Drop excess load that would risk service health if ingested.

Quotas

Limit consumer load to pre-requested amounts and provision to reflect limits.

QoS/criticality

Prioritize work and drop load that is less important.

Autoscaling

Automatically react to changes in load and provision additional servers.

Caching

Increase efficiency and thereby max load by degrading to staleness.

Capacity planning

Forecast demand and model service cost.

Cost modeling involves measuring resource usage and predicting both the cost and volume of work. Services should measure provisioned resource dimensions like CPU, RAM, stored bytes, and disk spindles, and indirect measures like queries per second (QPS), concurrent in-flight requests, and throughput. These measurements should be made on historical production data for realistic distributions, on synthetic load tests to evaluate changes not yet deployed, and on live servers to reflect real-time conditions. From these measurements, you can predict the cost of a request, the volume of requests, and the resources necessary to serve a given load. For example, you can observe that over the last 90 days the 99th percentile usage was 100 CPUs serving 10k QPS of traffic, and the traffic is growing at a rate of 10% per month.

Load shedding involves dropping excess load that would risk service health if accepted, and is thus critical for defense against overload and cascading failure. There are two forms of load shedding:

  • Request-based load shedding is where the server, on receipt of a request, evaluates current server health and immediately decides to accept or reject the request. If the request is rejected, an UNAVAILABLE error is returned so the client can retry the request at another server.

  • Queue-based load shedding is where the server puts incoming requests into a queue, rejecting requests only if the queue is too large or the server is unlikely to process the request before its deadline. Queue scheduling ensures that request processing starts only if it wouldn’t risk service health. Queue scheduling should use a last-in-first-out (LIFO) policy to maximize useful work completed by preventing a failure mode where work dequeued via first-in-first-out (FIFO) has insufficient time remaining before the request deadline. This may seem surprising because LIFO scheduling adds disproportionate queue delay to requests arriving earlier, which increases the chance those requests exceed deadlines and thus may appear to increase overall error rates. When overloaded, however, the lowest error rate will occur when resources are maximally used to produce successful responses, and LIFO scheduling minimizes resources wasted on requests that will exceed deadlines anyway.

Both load-shedding methods ensure that accepting a request does not cause the server to become unhealthy. Request-based load shedding has better end-to-end latency when the time to retry across servers is faster than an individual server’s queue delay. Queue-based load shedding is more efficient by not resending requests, which saves data copying and serialization costs. At Google, most servers use request-based load shedding for its simplicity, whereas the most performance sensitive servers use queue-based load shedding and scheduling for its efficiency.

Quotas are limits on service usage established ahead of time between a consumer and the service, limits that enable the service to provision for a specific amount of load, to set usage expectations a consumer can depend on, and to isolate consumers from each other. Quotas should be defined at all layers of the stack including the application and the cloud service provider. That is, as an application developer you are subject to cloud service provider quotas, and you should define and enforce quotas for your own service and its consumers. A quota is broken into two components: a commitment, which is the amount a consumer is guaranteed to be able to use, and a ceiling, which is the largest value a commitment could be adjusted to without negotiation with the service. A consumer’s ability to increase the commitment is not always guaranteed and could depend on current commitments across consumers and the service’s provisioning. The separation between commitment and ceiling enables a service to safely oversubscribe resources and to reduce the frequency of quota negotiations. A service with hard limits will reject all requests over the commitment, whereas a service with soft limits can accept requests over the commitment so long as it doesn’t sacrifice isolation between consumers. Prefer soft limits to reduce headroom for traffic fluctuations and to reduce over-quota error rates when the service has spare capacity.

Quality of service (QoS)/criticality are techniques for assigning priorities to different streams or individual requests so that upon resource exhaustion the requests are dropped in priority order. For example, the service can prioritize user-facing requests over traffic from batch jobs that can be delayed without issue. Table 4-3 shows common criticality levels for requests. Request-based load shedding and quota enforcement will use different utilization thresholds at which rejections occur to create prioritization. For example, CRITICAL_PLUS may be dropped at 100% utilization whereas SHEDDABLE may be dropped at 50% utilization. Queue-based load shedding and quota enforcement will simply schedule work in criticality order.

Table 4-3. Common criticality levels for requests
Criticality Description Use case
CRITICAL_PLUS Most important traffic class
Service provisioned for 100% of traffic
Expected to be <50% of traffic
Most important >user-facing requests
CRITICAL Service provisioned for 100% of traffic
Shedding expected only in outage
Default for user-facing requests
SHEDDABLE_PLUS Partial unavailability expected Non-user-facing requests like async work and data processing pipelines
SHEDDABLE Frequent partial and occasional full unavailability expected Best effort traffic where progress is not critical

Within an organization, multiple services are all unlikely to peak in load at the same time. Within a service, multiple components are also all unlikely to peak in load at the same time. With fixed service and component provisioning, one service or component can be overloaded while another is at relatively low utilization. Leverage autoscaling to dynamically scale down the underutilized servers, freeing up resources which can then be used to scale up the overloaded servers. Autoscaling provisions capacity where it’s needed to reduce error rates without increasing total resource cost.

Through load shedding, quotas, criticality, and autoscaling, the service should remain healthy during overload and gracefully degrade to serve as much traffic as possible in priority order. However, this will still yield errors for the excess load beyond provisioned capacity. To do better, you’ll need to increase efficiency or add additional resources to the service.

Caching improves service efficiency by looking up a precomputed value from optimized storage, which can enable the service to handle more load by trading for eventual consistency and data staleness. As resource utilization increases, the service can dynamically adjust caching parameters like expiration times so the cache is more effective, thereby increasing request capacity and reducing error rates due to overload. For example, when a cached item is at expiration time, the service can attempt to lookup a fresher value, replacing it with the new data if successful and falling back to the stale data if the system is overloaded. Note this use case is different from the capacity caches previously discussed. While you should leverage caching to mitigate unexpected traffic, you should not rely on caching in order to serve expected traffic.

Finally, ensure the service has capacity for foreseeable demand through capacity planning. When forecasting resource needs, it’s important to account for lead time for acquiring resources and for variability in demand. Capacity planning can be automated end to end by modeling the relationship between services and components (e.g., service A depends on service B), modeling the high-level constraints on a service (e.g., must have presence in these regions with N + 2 redundancy), observing service usage indicators (e.g., QPS over time and CPU utilization), and feeding all of this data into an optimizer that reshapes production systems into optimal configurations and provides inputs for machine purchasing.

For more on resource exhaustion, see “Handling Overload”.

Thundering Herds and Hotspots

A thundering herd is a sudden and rapid increase in traffic typically initiated by an event. For example, if an app requires users to have the latest version of the app for it to function, then on release of a new version, all users are forced to immediately and simultaneously upgrade. If the app has millions of users, the application download system may receive millions of requests within seconds of the version release. Figure 4-6 shows such an example of a thundering herd of requests. Another example is the aftermath of a mobile carrier outage, a moment when many phone apps retry requests as soon as the network recovers, leading to a thundering herd. Thundering herds are particularly dangerous because they can overload a system faster than mitigations like caching and autoscaling can react.

Figure 4-6. Service QPS when an app version upgrade yielded a thundering herd.

A hotspot is where a subset of servers receive a disproportionate amount of load and subsequently become overloaded, despite overall service having spare capacity. Hotspots frequently occur during thundering herds. For example, an app store serving application downloads may experience a hotspot when the previously described version release sends millions of requests for a specific resource identifier, potentially causing the canonical server storing the new version’s data to receive all of the new traffic. Figure 4-7 shows a hotspot in a service architecture.

Figure 4-7. Traffic hotspot to a single database server storing data for a popular resource.

Gating is a technique for deduping equivalent operations and performing work once to satisfy multiple requests. Prior to doing an operation, the server can first check to see if there are equivalent concurrent operations in flight and if so, combine them into a batch. The server can then execute the operation once and return the result to each request as if the operation ran multiple times. Gating ensures traffic to backends is proportional to the number of servers and latency of an operation, rather than being proportional to incoming traffic, ensuring that thundering herds and hotspots cannot overload backends. For example, if a backend request takes 100ms to complete, then a single server will send at most 10 QPS of traffic to the backend regardless of whether that server receives 10 QPS of requests or 10k QPS of requests. Figure 4-8 shows the reduction in load on a distributed system visually.

Figure 4-8. Impact of gating on distributed system load.

Gating can be implemented two ways. To achieve eventual consistency with lower average latency, requests can be grouped into the batch that is currently being processed. To achieve strong consistency with higher average latency, requests can be grouped into batches that are only started after the batch currently being processed is completed. Figures 4-9 and 4-10 show how the two approaches work. Under heavy load both approaches reduce traffic equivalently, so the choice between them is typically based on consistency and latency requirements.

Figure 4-9. Reading a database with eventually consistent gating.
Figure 4-10. Reading a database with strongly consistent gating.

Hierarchical gating and caching is a technique where gating and caching are applied at multiple points to further dedupe load reaching a backend. This technique is sometimes referred to as L1 and L2 caching as it’s similar to how CPUs cache data from RAM. Figure 4-11 shows hierarchical caching in action where a server applies gating and caching prior to sending requests, and the backend server applies gating and caching upon receiving requests. The L1 gating dedupes concurrent requests and the L1 caching ensures keys in high demand don’t hotspot the backend. The L2 gating further dedupes concurrent requests and the L2 caching reduces overall load on storage.

Figure 4-11. Hierarchical gating and caching.

Add jitter or random delays to cache expiries to ensure cache expiration doesn’t become the cause of a thundering herd. For example, if there are 10k frontend servers each caching a particular key, if they all expire the data for the key at the same time, then the 10k servers will all send a request to the backend at the same time. Coordinated expiry frequently occurs during thundering herds, but also when the expiration time is synchronized by a backend. For example, coordinated expiry can occur if the expiration time is based on a fixed offset from when the backend first cached a storage read. By adding random delays, servers will send the backend requests at different times thereby spreading load over time.

Replication of data can also help by leveraging additional servers at the cost of eventual consistency. Static replication of all data to multiple servers will help with thundering herds and with hotspots, but this can increase the cost of storage by two or more times. Dynamic replication where data is selectively replicated based on demand will help with long-running hotspots with much lower increases in storage cost. However, dynamic replication will not help with thundering herds because the system will become overloaded faster than demand can be measured and data replicated.

Integrity, Backup, Recovery

It’s inevitable that a software bug will be released, a hard-drive disk will fail, and a mismanufactured CPU will corrupt data. Services must detect, mitigate, and recover from data loss and data corruption. You should set SLOs for recovery, use probers and checksums to detect data loss and corruption, and have backups and restore procedures to recover.

Services should have two SLOs for recovery from failure. The recovery point objective (RPO) is the point in time after which data may be permanently lost upon a failure. The recovery time objective (RTO) is the duration of time to restore the service after a failure, a period during which the service may be unavailable. For example, let’s say the failure is an accidental deletion of the service’s database. If your service sets an RPO of four hours, then any data written to the database in the last four hours may be lost. If your service sets an RTO of one hour then the service may be unavailable for one hour while the system recovers from the deletion. In order to meet these objectives, the service may create backups every four hours and regularly test recovery from backup to ensure it can be done within one hour. Figure 4-12 shows RPO and RTO visually on a timeline. Similar to other SLOs, the RPO and RTO should be set based on the needs of the users and business.

Figure 4-12. Recovery point objective (RPO) and recovery time objective (RTO).

RPO is typically measured by reporting the age of the freshest backup over time, yielding a measure similar to freshness SLOs. RTO is ideally measured by periodically executing an automated restore, yielding a measure similar to those generated by probers. For simplicity, RTO may be evaluated for just storage recovery like a database restore, or for critical services it can be the end-to-end recovery of the service. RPO and RTO are also tested as part of Disaster Recovery Testing (DiRT) and with techniques like Chaos Engineering.

Deploy probers to detect data loss or corruption by automatically testing that data can be read after written, and that data written a long time ago can still be read. Run these probers in development and staging environments to detect bugs before they can impact production workloads. Run these probers in production to reduce the time to detect an outage and thereby reduce the duration of outages. Finally, use burn-in load tests to exercise new and existing hardware to detect sources of corruption before production workloads are scheduled on the machines.

Generate and verify checksums to validate end-to-end integrity of data. Checksums like CRC32C can detect unintended changes to data like corruption from a bad CPU. Generate checksums as close to the source of data as possible and verify checksums as close to the use of data as possible in order to detect corruption end to end. For example, generating a checksum in the client before sending a write to a server and verifying the checksum in the client after a read from a server will ensure you detect any corruption that occurs in transit, during processing at the server, or while being stored on disk. Generate overlapping checksums at multiple layers for defense in depth. For example, checksums can be separately generated and verified for data objects at the client, network traffic between client and server, and database blocks on disk. Upon detecting corruption, apply mitigating actions like retrying requests that were corrupted and overwriting corrupted data from a healthy data replica.

Periodically snapshot and backup data to alternative storage systems and develop reliable restore procedures to recover the service from a backup. It’s typical to backup data every hour or every four hours. Backups are then retained for durations and granularities that reflect business needs. Hourly backups are stored in high performance storage like the Colossus distributed filesystem or Blobstore object store. Daily backups are stored in immutable media like firmware-locked hard drives for faster recovery, or tape drives for better storage cost. Restoration from firmware-locked hard drives can further reduce recovery time by restoring individual data items rather than entire datasets. Backup systems must be monitored for healthy operation and to verify the RPO.

Leverage point-in-time recovery techniques for failures at the application layer to reduce the recovery point to optimal levels for situations where the database and storage layers aren’t the source of data loss or corruption. Point-in-time recovery involves retaining the database’s underlying transaction logs and snapshot files so that read queries can be executed at arbitrary points in the database’s history. These queries can be used to build a detailed history for a data object so it can be restored to the point right before it was corrupted or lost. This helps recover from software bugs or other failures at the application layer significantly faster than backup restoration.

Develop an end-to-end recovery process including loading backups, merging backup data with live production state, and if necessary reconfiguring servers so the service can return to normal operation. The end-to-end process must be regularly tested to verify the backups and the recovery process will actually work when the time comes.

For more on data integrity, see “Data Integrity” by Raymond Blum and Rhandeev Singh in Site Reliability Engineering.

Recap

In this chapter, we’ve covered how to defend against common failure modes:

Bad changes
  • Supervise changes by monitoring with metrics and probers.

  • Progressively roll out changes by creating independent and small failure domains and then changing those failure domains one at a time.

  • Automatically apply safe and tested mitigations, like change rollback, once an issue has been detected.

  • Deploy isolated environments for development (dev), staging, and production (prod).

  • Use infrastructure-as-code (IaC) and multiparty authorization (MPA) to safely make changes to production.

Error handling
  • Use soft dependencies and fail-safe alternatives.

  • Return and respond to standardized error codes.

  • Resource exhaustion should differentiate between server overload, service overload, and quota exhaustion.

  • If retrying a request fails multiple times due to local server overload, the propagated error should be a service overload to prevent retry amplification.

  • Clients must slow down sending of requests when the service is overloaded.

  • Attach deadlines or timeouts to requests and check if these have been exceeded at multiple layers of the system.

Resource exhaustion
  • Resource exhaustion occurs when the load exceeds the resources provisioned.

  • Cascading failure is where the overload of a single component causes the load to shift and overload another component until all have failed.

  • Capacity caches are those where the service depends on the efficiency gains to prevent resource exhaustion.

  • Defend against resource exhaustion through cost modeling, load shedding, quotas, quality of service (QoS)/criticality, autoscaling, caching, and capacity planning.

Thundering herds and hotspots
  • A thundering herd is a sudden and rapid increase in traffic typically initiated by an event. A hotspot is where a subset of servers receive a disproportionate amount of load.

  • Gating is a technique for deduping equivalent operations and performing work once to satisfy multiple requests.

  • Hierarchical gating and caching is a technique where gating and caching are applied at multiple points to further dedupe load reaching a backend.

  • Add jitter or random delays to cache expiries to ensure cache expiration doesn’t become the cause of a thundering herd.

Integrity, backup, and recovery
  • Recovery point objective (RPO) is the window in which data may be permanently lost upon a failure. Recovery time objective (RTO) is the window of time the service may remain unavailable after a failure.

  • Deploy probers to detect data loss or corruption.

  • Generate and verify checksums to validate end-to-end integrity of data.

  • Periodically snapshot and back up data to alternative storage systems.

  • Leverage point-in-time recovery techniques for failures at the application layer.

  • Develop and test an end-to-end recovery process.

1 A sudden and rapid increase in traffic typically initiated by an event.

Get Building Reliable Services on the Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.