Errata for Site Reliability Engineering
Submit your own errata for this product.
The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update
Version |
Location |
Description |
Submitted By |
Date Submitted |
Date Corrected |
Other Digital Version |
-1
Footnote 41 |
On https://landing.google.com/sre/book/chapters/part3.html#id-dA2uaIyFqF4, the link to the US Digital Service page is broken. The new links you should use are either https://obamawhitehouse.archives.gov/participate/united-states-digital-service or https://www.usds.gov/.
Note from the Author or Editor: Replace https://www.whitehouse.gov/digital/united-states-digital-service with https://www.usds.gov.
|
Nick Heiner |
Jan 02, 2018 |
Oct 19, 2018 |
Other Digital Version |
-1
first bullet point |
On page https://landing.google.com/sre/book/chapters/practical-alerting.html:
"The sheer number of components being analyze"
"Analyze" should be "analyzed".
|
Nick Heiner |
Jan 02, 2018 |
|
Other Digital Version |
?
Introduction, Monitoring, 1st paragraph |
In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "monitoring" and "strategy" in the sentence […]monitoringstrategy should be constructed thoughtfully[…] and between "common" and "approach" in […]A classic and commonapproach to monitoring[…] and between "an" and "effective" in […]this type of email alerting is not aneffective solution[…] and between "an" and "email" in […]a human to read an emailand decide[…] and between "in" an "response" in […]action needs to be taken inresponse is fundamentally flawed.[…] and "a" and "human" in […]Monitoring should never require ahuman to interpret[…] and between "software" and "should" in […]Instead, softwareshould do the interpreting[…] and between "when" and "they" in […]notified only whenthey need to take action[…]
Note from the Author or Editor: Fixed on site.
|
Raphaël Doursenaud |
Feb 02, 2017 |
Aug 04, 2017 |
Other Digital Version |
?
Introduction, Pursuing Maximum Change Velocity, 6th paragraph |
In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "both" and "development" in the sentence […]and an occurrence that bothdevelopment and SRE teams[…]
Note from the Author or Editor: Fixed on site.
|
Raphaël Doursenaud |
Feb 02, 2017 |
Aug 04, 2017 |
Other Digital Version |
?
Introduction, Google's Approach to Service Management, 4th paragraph |
In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "is" and "useful" in the following sentence:
[…]in addition had a set of technical skills that isuseful to SRE[…]
Note from the Author or Editor: Fixed on site.
|
Raphaël Doursenaud |
Feb 02, 2017 |
Aug 04, 2017 |
Printed |
Page xxiii
last line of page |
In the acknowledgements list, "Sean Sechrest" is listed (a Google SRE). Sean's actual name is "Sean Sechrist".
Note from the Author or Editor: Replaced Sechrest with Sechrist.
|
Doug Meil |
Nov 27, 2017 |
Oct 19, 2018 |
PDF |
Page 5
3rd paragraph, last sentence |
"...SRE can be broken down into two main categories."
Note from the Author or Editor: "As a whole, SRE can be broken down two main categories." should be "As a whole, SREs can be broken down into two main categories.".
|
Anonymous |
Apr 26, 2016 |
Jan 13, 2017 |
Printed |
Page 21
3rd & 4th paragraphs |
'HTML request' should be 'HTTP request'.
|
Anonymous |
Apr 21, 2016 |
Jan 13, 2017 |
Printed, PDF, ePub, Safari Books Online |
Page 44
last 2 paragraphs |
The last 2 paragraphs' last non-parenthetical sentences are the same.
"Upper management will probably want a monthly or quarterly assessment, too."
Note from the Author or Editor: The last two paragraphs on page 44 ("It's both unrealistic and undesirable ..." and "The rate at which SLOs are missed ...") are essentially duplicates in slightly different wording. The second paragraph should be removed.
|
Daniel Rogers |
Sep 14, 2016 |
Jan 13, 2017 |
Printed |
Page 61
footnote 2 |
"If 1% of your requests are 10x the average, " should be "If 1% of your requests are 50x the average, " as it's 5s (99-percentile) / 100ms (average).
Note from the Author or Editor: Change "10x" to "50x".
|
Tatz Sekine |
Jan 09, 2017 |
Jan 13, 2017 |
Printed |
Page 77
3rd paragraph |
"why each clusters took six or more weeks"
"clusters" should be singular.
Note from the Author or Editor: Replace "why each clusters took six or more weeks" with "why each cluster took six or more weeks".
|
Ai Vong |
Feb 25, 2018 |
Oct 19, 2018 |
PDF |
Page 79
Figure 7-2. |
The figure doesn't correspond to the described process. According to the description, if the test fails the corresponding fix is called and then the test is re-tried. The figure doesn't represent the re-try but there is a direct arrow from the fix box to the next test.
Note from the Author or Editor: Figure 7-2 should show another call between TestDNSMonitoringConfigExists and FixDNSMonitoringCreateConfig. Perhaps the existing arrow should be double-ended?
|
Eleni Siakagianni |
Jun 05, 2017 |
Aug 04, 2017 |
Printed |
Page 92
Figure 8-1 |
Image isn't formatted for black & white printing, so there's no differentiation between box colours.
|
Anonymous |
Apr 21, 2016 |
Jan 13, 2017 |
Printed |
Page 94
Penultimate paragraph |
from Daisuke Yabuki
The last sentence reads "our source-based filesystem [Kem11]."
In Bibliography on page 507, [Kem11] refers to:
C. Kemper, "Build in the Cloud: How the Build System works"
However this blog article doesn't mention source-based filesystem.
I believe the correct reference should be:
N.York, "Build in the Cloud: Accessing Source Code"
|
Anonymous |
Jul 25, 2017 |
Aug 04, 2017 |
Printed |
Page 111
First paragraph |
SNMP abbreviation is decoded with an error.
Book says:
SNMP (Simple Networking Monitoring Protocol)
but should be:
SNMP (Simple *Network Management* Protocol)
See referenced in book source for the prove:
https://technet.microsoft.com/en-us/library/cc776379(v=ws.10).aspx
or Wikipedia:
https://en.wikipedia.org/wiki/Simple_Network_Management_Protocol
Note from the Author or Editor: Replace "Simple Networking Monitoring Protocol" with "Simple Network Management Protocol".
|
Vladimir Rutsky |
Nov 06, 2016 |
Jan 13, 2017 |
Printed |
Page 115
Lemur inset |
"nonmonotonically decreasing value" means the value is decreasing, but not monotonically. This is contradicted by the second half of the sentence, which states the meaning that the authors intended: that counter values only increase.
Note from the Author or Editor: "nonmonotonically decreasing" should be replaced with "monotonically non-decreasing".
|
Cory Lueninghoener |
Aug 21, 2016 |
Jan 13, 2017 |
Printed, PDF |
Page 116
2nd paragraph |
For the results of task:http_requests:rate10m rule, hostnames in the instance labels should be host0 through host4.
.
Note from the Author or Editor: Replace host2 .. host5 with host1 .. host4.
|
Kazushige Hosokawa |
May 20, 2017 |
Aug 04, 2017 |
Printed |
Page 117
Borgmon code example |
Missing ');' in second rule.
'jobwebserver' should be 'job=webserver' in third rule.
|
Chris Jones |
Dec 15, 2016 |
Jan 13, 2017 |
Printed, PDF |
Page 118
2nd and 3rd paragraphs in the Alerting section |
"number of errors" should be "number of errors per second" as per the corresponding borgmon rule expression ({var=dc:http_errors:rate10m,job=webserver} > 1).
Note from the Author or Editor: Append "per second" to "number of errors exceeds 1:".
|
Kazushige Hosokawa |
Jan 09, 2017 |
Jan 13, 2017 |
Printed |
Page 164
Paragraph 5 |
Should be a space in 'usable.Retain'.
|
Anonymous |
Apr 21, 2016 |
Jan 13, 2017 |
Printed |
Page 165
8th paragraph |
The scenario starts on page 161 at 2pm on a Friday. The "Managed Incident" replay has Mary returning to work on the day after the incident, which would be a Saturday. Is that intentional?
Note from the Author or Editor: Unintentional. Friday should be changed to Thursday.
|
Dave Smith |
Jul 03, 2016 |
Jan 13, 2017 |
Printed |
Page 172
1st paragraph |
Lest the impression be left that no names of any time appear in a postmortem, clarifying that "user" means "end-user" or "customer" might be appropriate.
Note from the Author or Editor: s/user/end-user/
|
Dave Smith |
Jul 03, 2016 |
Jan 13, 2017 |
Printed |
Page 189
2nd paragraph |
In the first sentence of the paragraph, "a subset of servers is upgraded", but few sentences later, "the single modified server can be quickly reverted".
Either "a subset of servers" or "single (modified) server", the number of the servers should be the same in those sentences, to revert canary-ing properly.
Note from the Author or Editor: Replace "the single modified server" with "the modified servers", and in the next sentence, replace "the upgraded server" with "the upgraded servers".
|
Tatz Sekine |
Apr 05, 2017 |
Aug 04, 2017 |
Printed |
Page 198
1st paragraph |
In the explanation for acceptable flakiness calculation, there is "0.99 (the fraction of patches that can be rejected)", but it might be "0.99 (the fraction of patches that should be accepted)".
Note from the Author or Editor: Replace "that can be rejected" with "that are accepted".
|
Tatz Sekine |
Apr 07, 2017 |
Aug 04, 2017 |
PDF |
Page 213
4th paragraph |
Missing ')' somewhere in the following sentence:
This component formulates a machine-readable request (a protocol buffer that can be understood by the Auxon Solver.
Note from the Author or Editor: Replace 'request (a' with 'request: a'.
|
Takeo Sawada |
Dec 26, 2016 |
Jan 13, 2017 |
Printed |
Page 239
2nd paragraph |
"could yield the following rounds" is misleading; I believe this should read "could yield the following shuffled_backends arrays for each round". I had to read this several times since the previous paragraph says that "we devide /client/ tasks into rounds", and here the elements are in fact backends.
I think that the whole section is hard to read and could be rephrased in much simpler words; I'd be happy to provide suggestions on request.
Note from the Author or Editor: "yield the following rounds:" should read "yield the following shuffled backends:", where backends is in code font.
|
Patrik Fimml |
Aug 24, 2016 |
Jan 13, 2017 |
Printed |
Page 250
5th (or, last) paragraph |
There is no assumption what multiplier K is in this paragraph, but the sentence "backends end up rejecting one request for each request they actually process" implies that the value of K is 2.
The next paragraph, in next page, there is another sentence "allowing roughly half of the backend resources to be consumed by ..." which also implies the value of K is 2.
Though, few paragraphs later, there is a mention: "We generally prefer the 2x multiplier".
Note from the Author or Editor: Move paragraph "We've found adaptive ... latency penalties." to be the second-last paragraph in the section, immediately before "One additional consideration ... to be expensive.".
|
Tatz Sekine |
Dec 23, 2016 |
Jan 13, 2017 |
PDF |
Page 265
Near top |
"This is the most important important exercise you should conduct in order to prevent server overload." should probably only have "important" once.
Note from the Author or Editor: s/most important important exercise/most important exercise/
|
Omer Zach |
Mar 01, 2017 |
Aug 04, 2017 |
Printed, PDF |
Page 268
1st paragraph of the "Retries" section |
As per the Go code on the same page, the number of backend RPCs per logical request should be 20, not 10.
Note from the Author or Editor: Change the Go code to try 10 times instead of 20.
(This avoids follow-on changes that would be needed in subsequent paragraphs were the previous paragraph updated to say 20 retries, matching the code.)
|
Kazushige Hosokawa |
Jan 24, 2017 |
Aug 04, 2017 |
Printed, PDF |
Page 273
3rd paragraph |
"RPCs between deeper layers of the stack" sounds like a single RPC chain, in which case cancellation propagation is not applicable. Maybe it should be something like "subsequent RPCs issued from within the same function", and "until it eventually times out, despite being unable to make progress." should be something like "until it returns or eventually times out, despite the function being unable to make progress".
Note from the Author or Editor: Revised paragraph on cancellation propagation to read as follows, in new subsection titled "Cancellation propagation":
"""
Propagating cancellations reduces unneeded or doomed work by advising servers in an RPC call stack that their efforts are no longer necessary. To reduce latency, some systems use "hedged requests" [Dea13] to send RPCs to a primary server, then some time later, send the same request to other instances of the same service in case the primary is slow in responding; once the client has received a response from any server, it sends messages to the other servers to cancel the now-superfluous requests. Those requests may themselves transitively fan out to many other servers, so cancellations should be propagated throughout the entire stack.
This approach can also be used to avoid the potential leakage that occurs if an initial RPC has a long deadline, but subsequent critical RPCs between deeper layers of the stack receive errors which can't succeed on retry, or have short deadlines and time out. Using only simple deadline propagation, the initial call continues to use server resources until it eventually times out, despite being doomed to failure. Sending fatal errors or timeouts up the stack and cancelling other RPCs in the call tree prevents unneeded work if the request as a whole can't be fulfilled.
"""
|
Kazushige Hosokawa |
May 03, 2017 |
Aug 04, 2017 |
Printed |
Page 298
Figure 23-8. Dueling proposers in Multi-Paxos - "Process 3" |
"Process 3 sends a conflicting Prepare messge" Should be "Process 3 sends a conflicting Prepare message"
Note from the Author or Editor: Replace 'messge' with 'message' in figure 23-8.
|
Rafael Capella |
Jan 07, 2017 |
Jan 13, 2017 |
Other Digital Version |
303
2nd to last paragraph |
Change "minutes" to "seconds".
There are 100 10-millisecond periods per second. There are 6000 10-millisecond periods per minute.
|
Chris Kennelly |
May 04, 2016 |
Jan 13, 2017 |
Printed |
Page 317
1st paragraph |
"... run once a month should not be be skipped." have an extra "be". Should be "... run once a month should not be skipped".
Note from the Author or Editor: Remove excess 'be'.
|
Rafael Capella |
Jan 07, 2017 |
Jan 13, 2017 |
Printed, PDF |
Page 420
The last paragraph |
The sentence "In either case, ..." should not be on the second list item as "either" refers to both types of fires.
Note from the Author or Editor: Move sentence "In either case, the team needs to build tools to control the burn." out of the list into its own paragraph.
|
Kazushige Hosokawa |
Apr 08, 2017 |
Aug 04, 2017 |
PDF |
Page 488
Footnote 3 |
The second sentence of Footnote 3 might be missing some words (e.g., add "for example" before "adding specific ..." and also add some description at the end of the sentence why it is bad).
Note from the Author or Editor: Add "such as" before "adding specific monitoring/alerting".
|
Takeo Sawada |
Apr 21, 2017 |
Aug 04, 2017 |
Printed, PDF |
Page 505
Jai13 |
Jai13 points to https://research.google.com/pubs/pub41761.html, but should point to https://research.google.com/pubs/pub41761.pdf.
|
Michael Stapelberg |
Aug 31, 2016 |
Jan 13, 2017 |
Printed |
Page 508
Pot16 |
Paper is now published, Communications of the ACM, Vol. 59 No. 7, Pages 78-87; http://dl.acm.org/citation.cfm?id=2963119.2854146.
|
Chris Jones |
Jun 29, 2016 |
Jan 13, 2017 |
|