Errata

Site Reliability Engineering

Errata for Site Reliability Engineering

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Other Digital Version
-1
first bullet point

On page https://landing.google.com/sre/book/chapters/practical-alerting.html:

"The sheer number of components being analyze"

"Analyze" should be "analyzed".

Nick Heiner  Jan 02, 2018 
Other Digital Version
-1
Footnote 41

On https://landing.google.com/sre/book/chapters/part3.html#id-dA2uaIyFqF4, the link to the US Digital Service page is broken. The new links you should use are either https://obamawhitehouse.archives.gov/participate/united-states-digital-service or https://www.usds.gov/.

Note from the Author or Editor:
Replace https://www.whitehouse.gov/digital/united-states-digital-service with https://www.usds.gov.

Nick Heiner  Jan 02, 2018  Oct 19, 2018
Page -1
https://sre.google/sre-book/bibliography/

[Kri12]: Author's name is printed as "K. Krishan", should be "K. Krishnan".

Note from the Author or Editor:
"K. Krishan" should be "K. Krishnan" in the bibliography.

Michael Farrell  Jan 17, 2022 
Other Digital Version
?
Introduction, Google's Approach to Service Management, 4th paragraph

In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "is" and "useful" in the following sentence:
[…]in addition had a set of technical skills that isuseful to SRE[…]

Note from the Author or Editor:
Fixed on site.

Raphaël Doursenaud  Feb 02, 2017  Aug 04, 2017
Other Digital Version
?
Introduction, Pursuing Maximum Change Velocity, 6th paragraph

In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "both" and "development" in the sentence […]and an occurrence that bothdevelopment and SRE teams[…]

Note from the Author or Editor:
Fixed on site.

Raphaël Doursenaud  Feb 02, 2017  Aug 04, 2017
Other Digital Version
?
Introduction, Monitoring, 1st paragraph

In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "monitoring" and "strategy" in the sentence […]monitoringstrategy should be constructed thoughtfully[…] and between "common" and "approach" in […]A classic and commonapproach to monitoring[…] and between "an" and "effective" in […]this type of email alerting is not aneffective solution[…] and between "an" and "email" in […]a human to read an emailand decide[…] and between "in" an "response" in […]action needs to be taken inresponse is fundamentally flawed.[…] and "a" and "human" in […]Monitoring should never require ahuman to interpret[…] and between "software" and "should" in […]Instead, softwareshould do the interpreting[…] and between "when" and "they" in […]notified only whenthey need to take action[…]

Note from the Author or Editor:
Fixed on site.

Raphaël Doursenaud  Feb 02, 2017  Aug 04, 2017
Printed
Page xxiii
last line of page

In the acknowledgements list, "Sean Sechrest" is listed (a Google SRE). Sean's actual name is "Sean Sechrist".

Note from the Author or Editor:
Replaced Sechrest with Sechrist.

Doug Meil  Nov 27, 2017  Oct 19, 2018
Other Digital Version
Appendix E - Launch Coordination Checklist
At the bottom of the page

https://landing.google.com/sre/sre-book/chapters/launch-checklist/

Shouldn't "External dependencies" and "Schedule and rollout planning" have same indentation level as "Growth issues"?

Note from the Author or Editor:
Yes, they should; ETA for fix to appear: 11/30/2020

Maksim Fedoseev  Mar 03, 2019  Nov 30, 2020
Other Digital Version
Chapter 22 - Addressing Cascading Failures
In quite a lot of places across this chapter.

https://landing.google.com/sre/sre-book/chapters/addressing-cascading-failures/

There are quite a lot of whitespace characters missing in this chapter. Namely:

1 & 2.
A vicious cycle can occur in this scenario:less CPU is available, resulting in slower requests, resulting in increased RAM usage, resulting in more GC, resulting in even lower availability of CPU.This is known colloquially as the “GC death spiral.”

One after ":" and another one in "...of CPU.This...".

3.
Serve lower-quality, cheaper-to-compute results to the user. Your strategy here will be service-specific.See Load Shedding and Graceful Degra

After "be service-specific".

4.
Servers should protect themselves from becoming overloaded and crashing.When overloaded at either the frontend or backend layers, fail early and cheaply. For details, see Load Shedding and Graceful Degradation.

After "and crashing".

5.
Note that because rate limiting often doesn’t take overall service health into account, it may not be able to stop a failure that has already begun.Simple rate-limiting implementations are also likely to leave capacity unused. Rate limiting can be implemented in a number of places:

After "has already begun".

6 & 7.
Good capacity planning can reduce the probability that a cascading failure will occur.Capacity planning should be coupled with performance testing to determine the load at which the service will fail.For instance, if every cluster’s breaking point is 5,000 QPS, the load is evenly spread across clusters,109 and the service’s peak load is 19,000 QPS, then approximately six clusters are needed to run the service at N + 2.

One after "failure will occur" and the other one after "service will fail".

8.
Most thread-per-request servers use a queue in front of a thread pool to handle requests.Requests come in, they sit on a queue, and then threads pick requests off the queue and perform the actual work (whatever actions are required by the server). Usually, if the queue is full, the server will reject new requests.

After "handle requests".

9.
More sophisticated approaches include identifying clients to be more selective about what work is dropped, or picking requests that are more important and prioritizing.Such strategies are more likely to be needed for shared services.

After "are more important and prioritizing".

10 & 11.
Suppose the code in the frontend that talks to the backend implements retries naively.It retries after encountering a failure and caps the number of backend RPCs per logical request to 10.Consider this code in the frontend, using gRPC in Go:

After "implements retries naively" and "RPCs per logical request to 10".

12.
Think about the service holistically and decide if you really need to perform retries at a given level. In particular, avoid amplifying retries by issuing retries at multiple levels: a single request at the highest layer may produce a number of attempts as large as the product of the number of attempts at each layer to the lowest layer. If the database can’t service requests because it’s overloaded, and the backend, frontend, and JavaScript layers all issue 3 retries (4 attempts), then a single user action may create 64 attempts (4^3) on the database.This behavior is undesirable when the database is returning those errors because it’s overloaded.

After "on the database".

13.
Suppose an RPC has a 10-second deadline, as set by the client.The server is very overloaded, and as a result, it takes 11 seconds to move from a queue to a thread pool. At this point, the client has already given up on the request. Under most circumstances, it would be unwise for the server to attempt to handle this request, because it would be doing work for which no credit will be granted—the client doesn’t care what work the server does after the deadline has passed, because it’s given up on the request already.

After "as set by the client".

14.
Suppose that the frontend from the preceding example consists of 10 servers, each with 100 worker threads. This means that the frontend has a total of 1,000 threads of capacity.During usual operation, the frontends perform 1,000 QPS and requests complete in 100 ms. This means that the frontends usually have 100 worker threads occupied out of the 1,000 configured worker threads (1,000 QPS * 0.1 seconds).

After "1,000 threads of capacity".

15.
Suppose an event causes 5% of the requests to never complete.This could be the result of the unavailability of some Bigtable row ranges, which renders the requests corresponding to that Bigtable keyspace unservable. As a result, 5% of the requests hit the deadline, while the remaining 95% of the requests take the usual 100 ms.

After "5% of the requests to never complete".

16.
Employ general cascading failure prevention techniques.In particular, servers should reject requests when they’re overloaded or enter degraded modes, and testing should be performed to see how the service behaves after events such as a large restart.

After "failure prevention techniques".

17.
Understand how large clients use your service.For example, you want to know if clients:

After "clients use your service".

18.
If servers are somehow wedged and not making progress, restarting them may help.Try restarting servers when:

After "them may help".

Note from the Author or Editor:
Yes, this is likely an HTML issue; Google has opened a web bug to address

Maksim Fedoseev  Mar 03, 2019 
Other Digital Version
Chapter 10
In the section Maintaining the Configuration, the first bullet point

The first bullet point is missing a close parenthesis: "(e.g., our HTTP response code on the http_responses variable"

I am reading this in the free online version at https://landing.google.com/sre/sre-book/chapters/practical-alerting/, so I am unable to give a page number.

Matt Halverson  May 05, 2019 
Other Digital Version
x
At reference Jai13

At online SRE book page:
https://landing.google.com/sre/sre-book/chapters/production-environment/#id-N1KFQTnFxhW
link for reference Jai13 is broken .

The link https://static.googleusercontent.com/media/research.google.com/en//pubs/pub41761.pdf
should be substituted with
https://ai.google/research/pubs/pub41761

Note from the Author or Editor:
corrected and will go live 11/18/2020

Radu Prekup  Jul 01, 2019 
Other Digital Version
x
On references page

At SRE Book page https://landing.google.com/sre/sre-book/chapters/bibliography/#Sch15
link https://ramcloud.stanford.edu/raft.pdf to [Ong14] D. Ongaro and J. Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)".
Correct link is now: https://raft.github.io/raft.pdf

Radu Prekup  Jul 01, 2019 
Other Digital Version
x
https://landing.google.com/sre/sre-book/chapters/bibliography/#Mor12a

If correct ordering of references is first by author and then by publication date, then:
Dea13, Dea04,Dea07
Should be ordered:
Dea04,Dea07, Dea13

Radu Prekup  Jul 01, 2019 
PDF
Page 5
3rd paragraph, last sentence

"...SRE can be broken down into two main categories."

Note from the Author or Editor:
"As a whole, SRE can be broken down two main categories." should be "As a whole, SREs can be broken down into two main categories.".

Anonymous  Apr 26, 2016  Jan 13, 2017
Printed
Page 21
3rd & 4th paragraphs

'HTML request' should be 'HTTP request'.

Anonymous  Apr 21, 2016  Jan 13, 2017
Printed, PDF, ePub,
Page 44
last 2 paragraphs

The last 2 paragraphs' last non-parenthetical sentences are the same.

"Upper management will probably want a monthly or quarterly assessment, too."

Note from the Author or Editor:
The last two paragraphs on page 44 ("It's both unrealistic and undesirable ..." and "The rate at which SLOs are missed ...") are essentially duplicates in slightly different wording. The second paragraph should be removed.

Daniel Rogers  Sep 14, 2016  Jan 13, 2017
Printed
Page 61
footnote 2

"If 1% of your requests are 10x the average, " should be "If 1% of your requests are 50x the average, " as it's 5s (99-percentile) / 100ms (average).

Note from the Author or Editor:
Change "10x" to "50x".

Tatz Sekine  Jan 09, 2017  Jan 13, 2017
Printed
Page 77
3rd paragraph

"why each clusters took six or more weeks"

"clusters" should be singular.

Note from the Author or Editor:
Replace "why each clusters took six or more weeks" with "why each cluster took six or more weeks".

Ai Vong  Feb 25, 2018  Oct 19, 2018
PDF
Page 79
Figure 7-2.

The figure doesn't correspond to the described process. According to the description, if the test fails the corresponding fix is called and then the test is re-tried. The figure doesn't represent the re-try but there is a direct arrow from the fix box to the next test.

Note from the Author or Editor:
Figure 7-2 should show another call between TestDNSMonitoringConfigExists and FixDNSMonitoringCreateConfig. Perhaps the existing arrow should be double-ended?

Eleni Siakagianni  Jun 05, 2017  Aug 04, 2017
Printed
Page 92
Figure 8-1

Image isn't formatted for black & white printing, so there's no differentiation between box colours.

Anonymous  Apr 21, 2016  Jan 13, 2017
Printed
Page 94
Penultimate paragraph

from Daisuke Yabuki

The last sentence reads "our source-based filesystem [Kem11]."

In Bibliography on page 507, [Kem11] refers to:
C. Kemper, "Build in the Cloud: How the Build System works"

However this blog article doesn't mention source-based filesystem.

I believe the correct reference should be:
N.York, "Build in the Cloud: Accessing Source Code"

Anonymous  Jul 25, 2017  Aug 04, 2017
Printed
Page 111
First paragraph

SNMP abbreviation is decoded with an error.
Book says:

SNMP (Simple Networking Monitoring Protocol)

but should be:

SNMP (Simple *Network Management* Protocol)

See referenced in book source for the prove:
https://technet.microsoft.com/en-us/library/cc776379(v=ws.10).aspx
or Wikipedia:
https://en.wikipedia.org/wiki/Simple_Network_Management_Protocol

Note from the Author or Editor:
Replace "Simple Networking Monitoring Protocol" with "Simple Network Management Protocol".

Vladimir Rutsky  Nov 06, 2016  Jan 13, 2017
Printed
Page 115
Lemur inset

"nonmonotonically decreasing value" means the value is decreasing, but not monotonically. This is contradicted by the second half of the sentence, which states the meaning that the authors intended: that counter values only increase.

Note from the Author or Editor:
"nonmonotonically decreasing" should be replaced with "monotonically non-decreasing".

Cory Lueninghoener  Aug 21, 2016  Jan 13, 2017
Printed, PDF
Page 116
2nd paragraph

For the results of task:http_requests:rate10m rule, hostnames in the instance labels should be host0 through host4.
.

Note from the Author or Editor:
Replace host2 .. host5 with host1 .. host4.

Kazushige Hosokawa  May 20, 2017  Aug 04, 2017
Printed
Page 117
Borgmon code example

Missing ');' in second rule.

'jobwebserver' should be 'job=webserver' in third rule.

Chris Jones
 
Dec 15, 2016  Jan 13, 2017
Printed, PDF
Page 118
2nd and 3rd paragraphs in the Alerting section

"number of errors" should be "number of errors per second" as per the corresponding borgmon rule expression ({var=dc:http_errors:rate10m,job=webserver} > 1).

Note from the Author or Editor:
Append "per second" to "number of errors exceeds 1:".

Kazushige Hosokawa  Jan 09, 2017  Jan 13, 2017
Printed
Page 164
Paragraph 5

Should be a space in 'usable.Retain'.

Anonymous  Apr 21, 2016  Jan 13, 2017
Printed
Page 165
8th paragraph

The scenario starts on page 161 at 2pm on a Friday. The "Managed Incident" replay has Mary returning to work on the day after the incident, which would be a Saturday. Is that intentional?

Note from the Author or Editor:
Unintentional. Friday should be changed to Thursday.

Dave Smith  Jul 03, 2016  Jan 13, 2017
Printed
Page 172
1st paragraph

Lest the impression be left that no names of any time appear in a postmortem, clarifying that "user" means "end-user" or "customer" might be appropriate.

Note from the Author or Editor:
s/user/end-user/

Dave Smith  Jul 03, 2016  Jan 13, 2017
Printed
Page 189
2nd paragraph

In the first sentence of the paragraph, "a subset of servers is upgraded", but few sentences later, "the single modified server can be quickly reverted".

Either "a subset of servers" or "single (modified) server", the number of the servers should be the same in those sentences, to revert canary-ing properly.

Note from the Author or Editor:
Replace "the single modified server" with "the modified servers", and in the next sentence, replace "the upgraded server" with "the upgraded servers".

Tatz Sekine  Apr 05, 2017  Aug 04, 2017
Printed
Page 198
1st paragraph

In the explanation for acceptable flakiness calculation, there is "0.99 (the fraction of patches that can be rejected)", but it might be "0.99 (the fraction of patches that should be accepted)".

Note from the Author or Editor:
Replace "that can be rejected" with "that are accepted".

Tatz Sekine  Apr 07, 2017  Aug 04, 2017
PDF
Page 213
4th paragraph

Missing ')' somewhere in the following sentence:
This component formulates a machine-readable request (a protocol buffer that can be understood by the Auxon Solver.

Note from the Author or Editor:
Replace 'request (a' with 'request: a'.

Takeo Sawada  Dec 26, 2016  Jan 13, 2017
Printed
Page 239
2nd paragraph

"could yield the following rounds" is misleading; I believe this should read "could yield the following shuffled_backends arrays for each round". I had to read this several times since the previous paragraph says that "we devide /client/ tasks into rounds", and here the elements are in fact backends.

I think that the whole section is hard to read and could be rephrased in much simpler words; I'd be happy to provide suggestions on request.

Note from the Author or Editor:
"yield the following rounds:" should read "yield the following shuffled backends:", where backends is in code font.

Patrik Fimml  Aug 24, 2016  Jan 13, 2017
Printed
Page 250
5th (or, last) paragraph

There is no assumption what multiplier K is in this paragraph, but the sentence "backends end up rejecting one request for each request they actually process" implies that the value of K is 2.

The next paragraph, in next page, there is another sentence "allowing roughly half of the backend resources to be consumed by ..." which also implies the value of K is 2.

Though, few paragraphs later, there is a mention: "We generally prefer the 2x multiplier".

Note from the Author or Editor:
Move paragraph "We've found adaptive ... latency penalties." to be the second-last paragraph in the section, immediately before "One additional consideration ... to be expensive.".

Tatz Sekine  Dec 23, 2016  Jan 13, 2017
PDF
Page 265
Near top

"This is the most important important exercise you should conduct in order to prevent server overload." should probably only have "important" once.

Note from the Author or Editor:
s/most important important exercise/most important exercise/

Omer Zach  Mar 01, 2017  Aug 04, 2017
Printed, PDF
Page 268
1st paragraph of the "Retries" section

As per the Go code on the same page, the number of backend RPCs per logical request should be 20, not 10.

Note from the Author or Editor:
Change the Go code to try 10 times instead of 20.

(This avoids follow-on changes that would be needed in subsequent paragraphs were the previous paragraph updated to say 20 retries, matching the code.)

Kazushige Hosokawa  Jan 24, 2017  Aug 04, 2017
Printed, PDF
Page 273
3rd paragraph

"RPCs between deeper layers of the stack" sounds like a single RPC chain, in which case cancellation propagation is not applicable. Maybe it should be something like "subsequent RPCs issued from within the same function", and "until it eventually times out, despite being unable to make progress." should be something like "until it returns or eventually times out, despite the function being unable to make progress".

Note from the Author or Editor:
Revised paragraph on cancellation propagation to read as follows, in new subsection titled "Cancellation propagation":

"""
Propagating cancellations reduces unneeded or doomed work by advising servers in an RPC call stack that their efforts are no longer necessary. To reduce latency, some systems use "hedged requests" [Dea13] to send RPCs to a primary server, then some time later, send the same request to other instances of the same service in case the primary is slow in responding; once the client has received a response from any server, it sends messages to the other servers to cancel the now-superfluous requests. Those requests may themselves transitively fan out to many other servers, so cancellations should be propagated throughout the entire stack.

This approach can also be used to avoid the potential leakage that occurs if an initial RPC has a long deadline, but subsequent critical RPCs between deeper layers of the stack receive errors which can't succeed on retry, or have short deadlines and time out. Using only simple deadline propagation, the initial call continues to use server resources until it eventually times out, despite being doomed to failure. Sending fatal errors or timeouts up the stack and cancelling other RPCs in the call tree prevents unneeded work if the request as a whole can't be fulfilled.
"""

Kazushige Hosokawa  May 03, 2017  Aug 04, 2017
Printed
Page 298
Figure 23-8. Dueling proposers in Multi-Paxos - "Process 3"

"Process 3 sends a conflicting Prepare messge" Should be "Process 3 sends a conflicting Prepare message"

Note from the Author or Editor:
Replace 'messge' with 'message' in figure 23-8.

Rafael Capella  Jan 07, 2017  Jan 13, 2017
Other Digital Version
303
2nd to last paragraph

Change "minutes" to "seconds".

There are 100 10-millisecond periods per second. There are 6000 10-millisecond periods per minute.

Chris Kennelly  May 04, 2016  Jan 13, 2017
Printed
Page 317
1st paragraph

"... run once a month should not be be skipped." have an extra "be". Should be "... run once a month should not be skipped".

Note from the Author or Editor:
Remove excess 'be'.

Rafael Capella  Jan 07, 2017  Jan 13, 2017
Page 325
4th paragraph

When describing the crontab specification, "every day of the week" should read "every day of the month"

Note from the Author or Editor:

Anonymous  Dec 30, 2020 
Printed, PDF
Page 420
The last paragraph

The sentence "In either case, ..." should not be on the second list item as "either" refers to both types of fires.

Note from the Author or Editor:
Move sentence "In either case, the team needs to build tools to control the burn." out of the list into its own paragraph.

Kazushige Hosokawa  Apr 08, 2017  Aug 04, 2017
PDF
Page 488
Footnote 3

The second sentence of Footnote 3 might be missing some words (e.g., add "for example" before "adding specific ..." and also add some description at the end of the sentence why it is bad).

Note from the Author or Editor:
Add "such as" before "adding specific monitoring/alerting".

Takeo Sawada  Apr 21, 2017  Aug 04, 2017
Printed, PDF
Page 505
Jai13

Jai13 points to https://research.google.com/pubs/pub41761.html, but should point to https://research.google.com/pubs/pub41761.pdf.

Michael Stapelberg  Aug 31, 2016  Jan 13, 2017
Printed
Page 508
Pot16

Paper is now published, Communications of the ACM, Vol. 59 No. 7, Pages 78-87; http://dl.acm.org/citation.cfm?id=2963119.2854146.

Chris Jones
 
Jun 29, 2016  Jan 13, 2017