Site Reliability Engineering

Errata for Site Reliability Engineering

Submit your own errata for this product.


The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.


Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update



Version Location Description Submitted By Date Submitted
Other Digital Version Appendix E - Launch Coordination Checklist
At the bottom of the page

https://landing.google.com/sre/sre-book/chapters/launch-checklist/ Shouldn't "External dependencies" and "Schedule and rollout planning" have same indentation level as "Growth issues"?

Maksim Fedoseev  Mar 03, 2019 
Other Digital Version Chapter 22 - Addressing Cascading Failures
In quite a lot of places across this chapter.

https://landing.google.com/sre/sre-book/chapters/addressing-cascading-failures/ There are quite a lot of whitespace characters missing in this chapter. Namely: 1 & 2. A vicious cycle can occur in this scenario:less CPU is available, resulting in slower requests, resulting in increased RAM usage, resulting in more GC, resulting in even lower availability of CPU.This is known colloquially as the “GC death spiral.” One after ":" and another one in "...of CPU.This...". 3. Serve lower-quality, cheaper-to-compute results to the user. Your strategy here will be service-specific.See Load Shedding and Graceful Degra After "be service-specific". 4. Servers should protect themselves from becoming overloaded and crashing.When overloaded at either the frontend or backend layers, fail early and cheaply. For details, see Load Shedding and Graceful Degradation. After "and crashing". 5. Note that because rate limiting often doesn’t take overall service health into account, it may not be able to stop a failure that has already begun.Simple rate-limiting implementations are also likely to leave capacity unused. Rate limiting can be implemented in a number of places: After "has already begun". 6 & 7. Good capacity planning can reduce the probability that a cascading failure will occur.Capacity planning should be coupled with performance testing to determine the load at which the service will fail.For instance, if every cluster’s breaking point is 5,000 QPS, the load is evenly spread across clusters,109 and the service’s peak load is 19,000 QPS, then approximately six clusters are needed to run the service at N + 2. One after "failure will occur" and the other one after "service will fail". 8. Most thread-per-request servers use a queue in front of a thread pool to handle requests.Requests come in, they sit on a queue, and then threads pick requests off the queue and perform the actual work (whatever actions are required by the server). Usually, if the queue is full, the server will reject new requests. After "handle requests". 9. More sophisticated approaches include identifying clients to be more selective about what work is dropped, or picking requests that are more important and prioritizing.Such strategies are more likely to be needed for shared services. After "are more important and prioritizing". 10 & 11. Suppose the code in the frontend that talks to the backend implements retries naively.It retries after encountering a failure and caps the number of backend RPCs per logical request to 10.Consider this code in the frontend, using gRPC in Go: After "implements retries naively" and "RPCs per logical request to 10". 12. Think about the service holistically and decide if you really need to perform retries at a given level. In particular, avoid amplifying retries by issuing retries at multiple levels: a single request at the highest layer may produce a number of attempts as large as the product of the number of attempts at each layer to the lowest layer. If the database can’t service requests because it’s overloaded, and the backend, frontend, and JavaScript layers all issue 3 retries (4 attempts), then a single user action may create 64 attempts (4^3) on the database.This behavior is undesirable when the database is returning those errors because it’s overloaded. After "on the database". 13. Suppose an RPC has a 10-second deadline, as set by the client.The server is very overloaded, and as a result, it takes 11 seconds to move from a queue to a thread pool. At this point, the client has already given up on the request. Under most circumstances, it would be unwise for the server to attempt to handle this request, because it would be doing work for which no credit will be granted—the client doesn’t care what work the server does after the deadline has passed, because it’s given up on the request already. After "as set by the client". 14. Suppose that the frontend from the preceding example consists of 10 servers, each with 100 worker threads. This means that the frontend has a total of 1,000 threads of capacity.During usual operation, the frontends perform 1,000 QPS and requests complete in 100 ms. This means that the frontends usually have 100 worker threads occupied out of the 1,000 configured worker threads (1,000 QPS * 0.1 seconds). After "1,000 threads of capacity". 15. Suppose an event causes 5% of the requests to never complete.This could be the result of the unavailability of some Bigtable row ranges, which renders the requests corresponding to that Bigtable keyspace unservable. As a result, 5% of the requests hit the deadline, while the remaining 95% of the requests take the usual 100 ms. After "5% of the requests to never complete". 16. Employ general cascading failure prevention techniques.In particular, servers should reject requests when they’re overloaded or enter degraded modes, and testing should be performed to see how the service behaves after events such as a large restart. After "failure prevention techniques". 17. Understand how large clients use your service.For example, you want to know if clients: After "clients use your service". 18. If servers are somehow wedged and not making progress, restarting them may help.Try restarting servers when: After "them may help".

Maksim Fedoseev  Mar 03, 2019 
Other Digital Version Chapter 22 - Addressing Cascading Failures
In a section about retries (named "Retries")

https://landing.google.com/sre/sre-book/chapters/addressing-cascading-failures/#retires There is a snippet of code written in Go which has a few issues. func exampleRpcCall(client pb.ExampleClient, request pb.Request) *pb.Response { // Set RPC timeout to 5 seconds. opts := grpc.WithTimeout(5 * time.Second) // Try up to 10 times to make the RPC call. attempts := 10 for attempts > 0 { conn, err := grpc.Dial(*serverAddr, opts...) if err != nil { // Something went wrong in setting up the connection. Try again. attempts-- continue } defer conn.Close() // Create a client stub and make the RPC call. client := pb.NewBackendClient(conn) response, err := client.MakeRequest(context.Background, request) if err != nil { // Something went wrong in making the call. Try again. attempts-- continue } return response } grpclog.Fatalf("ran out of attempts") } First, there is no point in passing client as an argument to the function since it's not used within the function and gets shadowed in the for loop. Second, despite the fact that grpclog.Fatalf calls os.Exit which stops an execution of program, compiler isn't smart enough to infer this, so exampleRpcCall function will cause compilation error without return statement at the end (see fixed code snippet below). Third, the name of a function. I think that calling it exampleRPCCall would've been more idiomatic. Fourth, grpc.WithTimeout returns a single option, not a slice of options, so, there is no point in using '...' in grpc.Dial call. Also, since it's a single option, it would've been better to just call it 'opt' (without 's'). Fifth, in a for loop, a connection is established on each try, which, probably, isn't what author might've intended. Most likely, the connection was meant to be closed after each failure, but their closure is deferred until the end of the function. Sixth (and the last one), is that in gRPC request in usually a pointer. Proposed fixed snippet: func exampleRPCCall(request *pb.Request) *pb.Response { // Set RPC timeout to 5 seconds. opt := grpc.WithTimeout(5 * time.Second) // Try up to 10 times to make the RPC call. attempts := 10 for attempts > 0 { conn, err := grpc.Dial(*serverAddr, opt) if err != nil { // Something went wrong in setting up the connection. Try again. attempts-- continue } // Create a client stub and make the RPC call. client := pb.NewBackendClient(conn) response, err := client.MakeRequest(context.Background, request) if err != nil { // Something went wrong in making the call. Try again. attempts-- conn.Close() continue } return response } grpclog.Fatalf("ran out of attempts") return nil }

Maksim Fedoseev  Mar 03, 2019 
Other Digital Version Chapter 10
In the section Maintaining the Configuration, the first bullet point

The first bullet point is missing a close parenthesis: "(e.g., our HTTP response code on the http_responses variable" I am reading this in the free online version at https://landing.google.com/sre/sre-book/chapters/practical-alerting/, so I am unable to give a page number.

Matt Halverson  May 05, 2019 
Other Digital Version x
At reference Jai13

At online SRE book page: https://landing.google.com/sre/sre-book/chapters/production-environment/#id-N1KFQTnFxhW link for reference Jai13 is broken . The link https://static.googleusercontent.com/media/research.google.com/en//pubs/pub41761.pdf should be substituted with https://ai.google/research/pubs/pub41761

Radu Prekup  Jul 01, 2019 
Other Digital Version x
On references page

At SRE Book page https://landing.google.com/sre/sre-book/chapters/bibliography/#Sch15 link https://ramcloud.stanford.edu/raft.pdf to [Ong14] D. Ongaro and J. Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)". Correct link is now: https://raft.github.io/raft.pdf

Radu Prekup  Jul 01, 2019 
Other Digital Version x
https://landing.google.com/sre/sre-book/chapters/bibliography/#Mor12a

If correct ordering of references is first by author and then by publication date, then: Dea13, Dea04,Dea07 Should be ordered: Dea04,Dea07, Dea13

Radu Prekup  Jul 01, 2019