Errata

Errata for Site Reliability Engineering

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
Other Digital Version	-1 first bullet point	On page https://landing.google.com/sre/book/chapters/practical-alerting.html: "The sheer number of components being analyze" "Analyze" should be "analyzed".	Nick Heiner	Jan 02, 2018
Other Digital Version	-1 Footnote 41	On https://landing.google.com/sre/book/chapters/part3.html#id-dA2uaIyFqF4, the link to the US Digital Service page is broken. The new links you should use are either https://obamawhitehouse.archives.gov/participate/united-states-digital-service or https://www.usds.gov/. Note from the Author or Editor: Replace https://www.whitehouse.gov/digital/united-states-digital-service with https://www.usds.gov.	Nick Heiner	Jan 02, 2018	Oct 19, 2018
	Page -1 https://sre.google/sre-book/bibliography/	[Kri12]: Author's name is printed as "K. Krishan", should be "K. Krishnan". Note from the Author or Editor: "K. Krishan" should be "K. Krishnan" in the bibliography.	Michael Farrell	Jan 17, 2022
Other Digital Version	? Introduction, Google's Approach to Service Management, 4th paragraph	In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "is" and "useful" in the following sentence: […]in addition had a set of technical skills that isuseful to SRE[…] Note from the Author or Editor: Fixed on site.	Raphaël Doursenaud	Feb 02, 2017	Aug 04, 2017
Other Digital Version	? Introduction, Pursuing Maximum Change Velocity, 6th paragraph	In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "both" and "development" in the sentence […]and an occurrence that bothdevelopment and SRE teams[…] Note from the Author or Editor: Fixed on site.	Raphaël Doursenaud	Feb 02, 2017	Aug 04, 2017
Other Digital Version	? Introduction, Monitoring, 1st paragraph	In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "monitoring" and "strategy" in the sentence […]monitoringstrategy should be constructed thoughtfully[…] and between "common" and "approach" in […]A classic and commonapproach to monitoring[…] and between "an" and "effective" in […]this type of email alerting is not aneffective solution[…] and between "an" and "email" in […]a human to read an emailand decide[…] and between "in" an "response" in […]action needs to be taken inresponse is fundamentally flawed.[…] and "a" and "human" in […]Monitoring should never require ahuman to interpret[…] and between "software" and "should" in […]Instead, softwareshould do the interpreting[…] and between "when" and "they" in […]notified only whenthey need to take action[…] Note from the Author or Editor: Fixed on site.	Raphaël Doursenaud	Feb 02, 2017	Aug 04, 2017
Printed	Page xxiii last line of page	In the acknowledgements list, "Sean Sechrest" is listed (a Google SRE). Sean's actual name is "Sean Sechrist". Note from the Author or Editor: Replaced Sechrest with Sechrist.	Doug Meil	Nov 27, 2017	Oct 19, 2018
Other Digital Version	Appendix E - Launch Coordination Checklist At the bottom of the page	https://landing.google.com/sre/sre-book/chapters/launch-checklist/ Shouldn't "External dependencies" and "Schedule and rollout planning" have same indentation level as "Growth issues"? Note from the Author or Editor: Yes, they should; ETA for fix to appear: 11/30/2020	Maksim Fedoseev	Mar 03, 2019	Nov 30, 2020
Other Digital Version	Chapter 22 - Addressing Cascading Failures In quite a lot of places across this chapter.	https://landing.google.com/sre/sre-book/chapters/addressing-cascading-failures/ There are quite a lot of whitespace characters missing in this chapter. Namely: 1 & 2. A vicious cycle can occur in this scenario:less CPU is available, resulting in slower requests, resulting in increased RAM usage, resulting in more GC, resulting in even lower availability of CPU.This is known colloquially as the “GC death spiral.” One after ":" and another one in "...of CPU.This...". 3. Serve lower-quality, cheaper-to-compute results to the user. Your strategy here will be service-specific.See Load Shedding and Graceful Degra After "be service-specific". 4. Servers should protect themselves from becoming overloaded and crashing.When overloaded at either the frontend or backend layers, fail early and cheaply. For details, see Load Shedding and Graceful Degradation. After "and crashing". 5. Note that because rate limiting often doesn’t take overall service health into account, it may not be able to stop a failure that has already begun.Simple rate-limiting implementations are also likely to leave capacity unused. Rate limiting can be implemented in a number of places: After "has already begun". 6 & 7. Good capacity planning can reduce the probability that a cascading failure will occur.Capacity planning should be coupled with performance testing to determine the load at which the service will fail.For instance, if every cluster’s breaking point is 5,000 QPS, the load is evenly spread across clusters,109 and the service’s peak load is 19,000 QPS, then approximately six clusters are needed to run the service at N + 2. One after "failure will occur" and the other one after "service will fail". 8. Most thread-per-request servers use a queue in front of a thread pool to handle requests.Requests come in, they sit on a queue, and then threads pick requests off the queue and perform the actual work (whatever actions are required by the server). Usually, if the queue is full, the server will reject new requests. After "handle requests". 9. More sophisticated approaches include identifying clients to be more selective about what work is dropped, or picking requests that are more important and prioritizing.Such strategies are more likely to be needed for shared services. After "are more important and prioritizing". 10 & 11. Suppose the code in the frontend that talks to the backend implements retries naively.It retries after encountering a failure and caps the number of backend RPCs per logical request to 10.Consider this code in the frontend, using gRPC in Go: After "implements retries naively" and "RPCs per logical request to 10". 12. Think about the service holistically and decide if you really need to perform retries at a given level. In particular, avoid amplifying retries by issuing retries at multiple levels: a single request at the highest layer may produce a number of attempts as large as the product of the number of attempts at each layer to the lowest layer. If the database can’t service requests because it’s overloaded, and the backend, frontend, and JavaScript layers all issue 3 retries (4 attempts), then a single user action may create 64 attempts (4^3) on the database.This behavior is undesirable when the database is returning those errors because it’s overloaded. After "on the database". 13. Suppose an RPC has a 10-second deadline, as set by the client.The server is very overloaded, and as a result, it takes 11 seconds to move from a queue to a thread pool. At this point, the client has already given up on the request. Under most circumstances, it would be unwise for the server to attempt to handle this request, because it would be doing work for which no credit will be granted—the client doesn’t care what work the server does after the deadline has passed, because it’s given up on the request already. After "as set by the client". 14. Suppose that the frontend from the preceding example consists of 10 servers, each with 100 worker threads. This means that the frontend has a total of 1,000 threads of capacity.During usual operation, the frontends perform 1,000 QPS and requests complete in 100 ms. This means that the frontends usually have 100 worker threads occupied out of the 1,000 configured worker threads (1,000 QPS * 0.1 seconds). After "1,000 threads of capacity". 15. Suppose an event causes 5% of the requests to never complete.This could be the result of the unavailability of some Bigtable row ranges, which renders the requests corresponding to that Bigtable keyspace unservable. As a result, 5% of the requests hit the deadline, while the remaining 95% of the requests take the usual 100 ms. After "5% of the requests to never complete". 16. Employ general cascading failure prevention techniques.In particular, servers should reject requests when they’re overloaded or enter degraded modes, and testing should be performed to see how the service behaves after events such as a large restart. After "failure prevention techniques". 17. Understand how large clients use your service.For example, you want to know if clients: After "clients use your service". 18. If servers are somehow wedged and not making progress, restarting them may help.Try restarting servers when: After "them may help". Note from the Author or Editor: Yes, this is likely an HTML issue; Google has opened a web bug to address	Maksim Fedoseev	Mar 03, 2019
Other Digital Version	Chapter 10 In the section Maintaining the Configuration, the first bullet point	The first bullet point is missing a close parenthesis: "(e.g., our HTTP response code on the http_responses variable" I am reading this in the free online version at https://landing.google.com/sre/sre-book/chapters/practical-alerting/, so I am unable to give a page number.	Matt Halverson	May 05, 2019
Other Digital Version	x At reference Jai13	At online SRE book page: https://landing.google.com/sre/sre-book/chapters/production-environment/#id-N1KFQTnFxhW link for reference Jai13 is broken . The link https://static.googleusercontent.com/media/research.google.com/en//pubs/pub41761.pdf should be substituted with https://ai.google/research/pubs/pub41761 Note from the Author or Editor: corrected and will go live 11/18/2020	Radu Prekup	Jul 01, 2019
Other Digital Version	x On references page	At SRE Book page https://landing.google.com/sre/sre-book/chapters/bibliography/#Sch15 link https://ramcloud.stanford.edu/raft.pdf to [Ong14] D. Ongaro and J. Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)". Correct link is now: https://raft.github.io/raft.pdf	Radu Prekup	Jul 01, 2019
Other Digital Version	x https://landing.google.com/sre/sre-book/chapters/bibliography/#Mor12a	If correct ordering of references is first by author and then by publication date, then: Dea13, Dea04,Dea07 Should be ordered: Dea04,Dea07, Dea13	Radu Prekup	Jul 01, 2019
PDF	Page 5 3rd paragraph, last sentence	"...SRE can be broken down into two main categories." Note from the Author or Editor: "As a whole, SRE can be broken down two main categories." should be "As a whole, SREs can be broken down into two main categories.".	Anonymous	Apr 26, 2016	Jan 13, 2017
Printed	Page 21 3rd & 4th paragraphs	'HTML request' should be 'HTTP request'.	Anonymous	Apr 21, 2016	Jan 13, 2017
Printed, PDF, ePub,	Page 44 last 2 paragraphs	The last 2 paragraphs' last non-parenthetical sentences are the same. "Upper management will probably want a monthly or quarterly assessment, too." Note from the Author or Editor: The last two paragraphs on page 44 ("It's both unrealistic and undesirable ..." and "The rate at which SLOs are missed ...") are essentially duplicates in slightly different wording. The second paragraph should be removed.	Daniel Rogers	Sep 14, 2016	Jan 13, 2017
Printed	Page 61 footnote 2	"If 1% of your requests are 10x the average, " should be "If 1% of your requests are 50x the average, " as it's 5s (99-percentile) / 100ms (average). Note from the Author or Editor: Change "10x" to "50x".	Tatz Sekine	Jan 09, 2017	Jan 13, 2017
Printed	Page 77 3rd paragraph	"why each clusters took six or more weeks" "clusters" should be singular. Note from the Author or Editor: Replace "why each clusters took six or more weeks" with "why each cluster took six or more weeks".	Ai Vong	Feb 25, 2018	Oct 19, 2018
PDF	Page 79 Figure 7-2.	The figure doesn't correspond to the described process. According to the description, if the test fails the corresponding fix is called and then the test is re-tried. The figure doesn't represent the re-try but there is a direct arrow from the fix box to the next test. Note from the Author or Editor: Figure 7-2 should show another call between TestDNSMonitoringConfigExists and FixDNSMonitoringCreateConfig. Perhaps the existing arrow should be double-ended?	Eleni Siakagianni	Jun 05, 2017	Aug 04, 2017
Printed	Page 92 Figure 8-1	Image isn't formatted for black & white printing, so there's no differentiation between box colours.	Anonymous	Apr 21, 2016	Jan 13, 2017
Printed	Page 94 Penultimate paragraph	from Daisuke Yabuki The last sentence reads "our source-based filesystem [Kem11]." In Bibliography on page 507, [Kem11] refers to: C. Kemper, "Build in the Cloud: How the Build System works" However this blog article doesn't mention source-based filesystem. I believe the correct reference should be: N.York, "Build in the Cloud: Accessing Source Code"	Anonymous	Jul 25, 2017	Aug 04, 2017
Printed	Page 111 First paragraph	SNMP abbreviation is decoded with an error. Book says: SNMP (Simple Networking Monitoring Protocol) but should be: SNMP (Simple Network Management Protocol) See referenced in book source for the prove: https://technet.microsoft.com/en-us/library/cc776379(v=ws.10).aspx or Wikipedia: https://en.wikipedia.org/wiki/Simple_Network_Management_Protocol Note from the Author or Editor: Replace "Simple Networking Monitoring Protocol" with "Simple Network Management Protocol".	Vladimir Rutsky	Nov 06, 2016	Jan 13, 2017
Printed	Page 115 Lemur inset	"nonmonotonically decreasing value" means the value is decreasing, but not monotonically. This is contradicted by the second half of the sentence, which states the meaning that the authors intended: that counter values only increase. Note from the Author or Editor: "nonmonotonically decreasing" should be replaced with "monotonically non-decreasing".	Cory Lueninghoener	Aug 21, 2016	Jan 13, 2017
Printed, PDF	Page 116 2nd paragraph	For the results of task:http_requests:rate10m rule, hostnames in the instance labels should be host0 through host4. . Note from the Author or Editor: Replace host2 .. host5 with host1 .. host4.	Kazushige Hosokawa	May 20, 2017	Aug 04, 2017
Printed	Page 117 Borgmon code example	Missing ');' in second rule. 'jobwebserver' should be 'job=webserver' in third rule.	Chris Jones	Dec 15, 2016	Jan 13, 2017
Printed, PDF	Page 118 2nd and 3rd paragraphs in the Alerting section	"number of errors" should be "number of errors per second" as per the corresponding borgmon rule expression ({var=dc:http_errors:rate10m,job=webserver} > 1). Note from the Author or Editor: Append "per second" to "number of errors exceeds 1:".	Kazushige Hosokawa	Jan 09, 2017	Jan 13, 2017
Printed	Page 164 Paragraph 5	Should be a space in 'usable.Retain'.	Anonymous	Apr 21, 2016	Jan 13, 2017
Printed	Page 165 8th paragraph	The scenario starts on page 161 at 2pm on a Friday. The "Managed Incident" replay has Mary returning to work on the day after the incident, which would be a Saturday. Is that intentional? Note from the Author or Editor: Unintentional. Friday should be changed to Thursday.	Dave Smith	Jul 03, 2016	Jan 13, 2017
Printed	Page 172 1st paragraph	Lest the impression be left that no names of any time appear in a postmortem, clarifying that "user" means "end-user" or "customer" might be appropriate. Note from the Author or Editor: s/user/end-user/	Dave Smith	Jul 03, 2016	Jan 13, 2017
Printed	Page 189 2nd paragraph	In the first sentence of the paragraph, "a subset of servers is upgraded", but few sentences later, "the single modified server can be quickly reverted". Either "a subset of servers" or "single (modified) server", the number of the servers should be the same in those sentences, to revert canary-ing properly. Note from the Author or Editor: Replace "the single modified server" with "the modified servers", and in the next sentence, replace "the upgraded server" with "the upgraded servers".	Tatz Sekine	Apr 05, 2017	Aug 04, 2017
Printed	Page 198 1st paragraph	In the explanation for acceptable flakiness calculation, there is "0.99 (the fraction of patches that can be rejected)", but it might be "0.99 (the fraction of patches that should be accepted)". Note from the Author or Editor: Replace "that can be rejected" with "that are accepted".	Tatz Sekine	Apr 07, 2017	Aug 04, 2017
PDF	Page 213 4th paragraph	Missing ')' somewhere in the following sentence: This component formulates a machine-readable request (a protocol buffer that can be understood by the Auxon Solver. Note from the Author or Editor: Replace 'request (a' with 'request: a'.	Takeo Sawada	Dec 26, 2016	Jan 13, 2017
Printed	Page 239 2nd paragraph	"could yield the following rounds" is misleading; I believe this should read "could yield the following shuffled_backends arrays for each round". I had to read this several times since the previous paragraph says that "we devide /client/ tasks into rounds", and here the elements are in fact backends. I think that the whole section is hard to read and could be rephrased in much simpler words; I'd be happy to provide suggestions on request. Note from the Author or Editor: "yield the following rounds:" should read "yield the following shuffled backends:", where backends is in code font.	Patrik Fimml	Aug 24, 2016	Jan 13, 2017
Printed	Page 250 5th (or, last) paragraph	There is no assumption what multiplier K is in this paragraph, but the sentence "backends end up rejecting one request for each request they actually process" implies that the value of K is 2. The next paragraph, in next page, there is another sentence "allowing roughly half of the backend resources to be consumed by ..." which also implies the value of K is 2. Though, few paragraphs later, there is a mention: "We generally prefer the 2x multiplier". Note from the Author or Editor: Move paragraph "We've found adaptive ... latency penalties." to be the second-last paragraph in the section, immediately before "One additional consideration ... to be expensive.".	Tatz Sekine	Dec 23, 2016	Jan 13, 2017
PDF	Page 265 Near top	"This is the most important important exercise you should conduct in order to prevent server overload." should probably only have "important" once. Note from the Author or Editor: s/most important important exercise/most important exercise/	Omer Zach	Mar 01, 2017	Aug 04, 2017
Printed, PDF	Page 268 1st paragraph of the "Retries" section	As per the Go code on the same page, the number of backend RPCs per logical request should be 20, not 10. Note from the Author or Editor: Change the Go code to try 10 times instead of 20. (This avoids follow-on changes that would be needed in subsequent paragraphs were the previous paragraph updated to say 20 retries, matching the code.)	Kazushige Hosokawa	Jan 24, 2017	Aug 04, 2017
Printed, PDF	Page 273 3rd paragraph	"RPCs between deeper layers of the stack" sounds like a single RPC chain, in which case cancellation propagation is not applicable. Maybe it should be something like "subsequent RPCs issued from within the same function", and "until it eventually times out, despite being unable to make progress." should be something like "until it returns or eventually times out, despite the function being unable to make progress". Note from the Author or Editor: Revised paragraph on cancellation propagation to read as follows, in new subsection titled "Cancellation propagation": """ Propagating cancellations reduces unneeded or doomed work by advising servers in an RPC call stack that their efforts are no longer necessary. To reduce latency, some systems use "hedged requests" [Dea13] to send RPCs to a primary server, then some time later, send the same request to other instances of the same service in case the primary is slow in responding; once the client has received a response from any server, it sends messages to the other servers to cancel the now-superfluous requests. Those requests may themselves transitively fan out to many other servers, so cancellations should be propagated throughout the entire stack. This approach can also be used to avoid the potential leakage that occurs if an initial RPC has a long deadline, but subsequent critical RPCs between deeper layers of the stack receive errors which can't succeed on retry, or have short deadlines and time out. Using only simple deadline propagation, the initial call continues to use server resources until it eventually times out, despite being doomed to failure. Sending fatal errors or timeouts up the stack and cancelling other RPCs in the call tree prevents unneeded work if the request as a whole can't be fulfilled. """	Kazushige Hosokawa	May 03, 2017	Aug 04, 2017
Printed	Page 298 Figure 23-8. Dueling proposers in Multi-Paxos - "Process 3"	"Process 3 sends a conflicting Prepare messge" Should be "Process 3 sends a conflicting Prepare message" Note from the Author or Editor: Replace 'messge' with 'message' in figure 23-8.	Rafael Capella	Jan 07, 2017	Jan 13, 2017
Other Digital Version	303 2nd to last paragraph	Change "minutes" to "seconds". There are 100 10-millisecond periods per second. There are 6000 10-millisecond periods per minute.	Chris Kennelly	May 04, 2016	Jan 13, 2017
Printed	Page 317 1st paragraph	"... run once a month should not be be skipped." have an extra "be". Should be "... run once a month should not be skipped". Note from the Author or Editor: Remove excess 'be'.	Rafael Capella	Jan 07, 2017	Jan 13, 2017
	Page 325 4th paragraph	When describing the crontab specification, "every day of the week" should read "every day of the month" Note from the Author or Editor:	Anonymous	Dec 30, 2020
Printed, PDF	Page 420 The last paragraph	The sentence "In either case, ..." should not be on the second list item as "either" refers to both types of fires. Note from the Author or Editor: Move sentence "In either case, the team needs to build tools to control the burn." out of the list into its own paragraph.	Kazushige Hosokawa	Apr 08, 2017	Aug 04, 2017
PDF	Page 488 Footnote 3	The second sentence of Footnote 3 might be missing some words (e.g., add "for example" before "adding specific ..." and also add some description at the end of the sentence why it is bad). Note from the Author or Editor: Add "such as" before "adding specific monitoring/alerting".	Takeo Sawada	Apr 21, 2017	Aug 04, 2017
Printed, PDF	Page 505 Jai13	Jai13 points to https://research.google.com/pubs/pub41761.html, but should point to https://research.google.com/pubs/pub41761.pdf.	Michael Stapelberg	Aug 31, 2016	Jan 13, 2017
Printed	Page 508 Pot16	Paper is now published, Communications of the ACM, Vol. 59 No. 7, Pages 78-87; http://dl.acm.org/citation.cfm?id=2963119.2854146.	Chris Jones	Jun 29, 2016	Jan 13, 2017