Site Reliability Engineering

Errata for Site Reliability Engineering

Submit your own errata for this product.


The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update



Version Location Description Submitted By Date Submitted Date Corrected
Other Digital Version
-1
Footnote 41

On https://landing.google.com/sre/book/chapters/part3.html#id-dA2uaIyFqF4, the link to the US Digital Service page is broken. The new links you should use are either https://obamawhitehouse.archives.gov/participate/united-states-digital-service or https://www.usds.gov/.

Note from the Author or Editor:
Replace https://www.whitehouse.gov/digital/united-states-digital-service with https://www.usds.gov.

Nick Heiner  Jan 02, 2018  Oct 19, 2018
Other Digital Version
-1
first bullet point

On page https://landing.google.com/sre/book/chapters/practical-alerting.html: "The sheer number of components being analyze" "Analyze" should be "analyzed".

Nick Heiner  Jan 02, 2018 
Other Digital Version
?
Introduction, Monitoring, 1st paragraph

In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "monitoring" and "strategy" in the sentence […]monitoringstrategy should be constructed thoughtfully[…] and between "common" and "approach" in […]A classic and commonapproach to monitoring[…] and between "an" and "effective" in […]this type of email alerting is not aneffective solution[…] and between "an" and "email" in […]a human to read an emailand decide[…] and between "in" an "response" in […]action needs to be taken inresponse is fundamentally flawed.[…] and "a" and "human" in […]Monitoring should never require ahuman to interpret[…] and between "software" and "should" in […]Instead, softwareshould do the interpreting[…] and between "when" and "they" in […]notified only whenthey need to take action[…]

Note from the Author or Editor:
Fixed on site.

Raphaël Doursenaud  Feb 02, 2017  Aug 04, 2017
Other Digital Version
?
Introduction, Pursuing Maximum Change Velocity, 6th paragraph

In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "both" and "development" in the sentence […]and an occurrence that bothdevelopment and SRE teams[…]

Note from the Author or Editor:
Fixed on site.

Raphaël Doursenaud  Feb 02, 2017  Aug 04, 2017
Other Digital Version
?
Introduction, Google's Approach to Service Management, 4th paragraph

In the online version at https://landing.google.com/sre/book/chapters/introduction.html a space is missing between "is" and "useful" in the following sentence: […]in addition had a set of technical skills that isuseful to SRE[…]

Note from the Author or Editor:
Fixed on site.

Raphaël Doursenaud  Feb 02, 2017  Aug 04, 2017
Printed
Page xxiii
last line of page

In the acknowledgements list, "Sean Sechrest" is listed (a Google SRE). Sean's actual name is "Sean Sechrist".

Note from the Author or Editor:
Replaced Sechrest with Sechrist.

Doug Meil  Nov 27, 2017  Oct 19, 2018
PDF
Page 5
3rd paragraph, last sentence

"...SRE can be broken down into two main categories."

Note from the Author or Editor:
"As a whole, SRE can be broken down two main categories." should be "As a whole, SREs can be broken down into two main categories.".

Anonymous  Apr 26, 2016  Jan 13, 2017
Printed
Page 21
3rd & 4th paragraphs

'HTML request' should be 'HTTP request'.

Anonymous  Apr 21, 2016  Jan 13, 2017
Printed, PDF, ePub, Safari Books Online
Page 44
last 2 paragraphs

The last 2 paragraphs' last non-parenthetical sentences are the same. "Upper management will probably want a monthly or quarterly assessment, too."

Note from the Author or Editor:
The last two paragraphs on page 44 ("It's both unrealistic and undesirable ..." and "The rate at which SLOs are missed ...") are essentially duplicates in slightly different wording. The second paragraph should be removed.

Daniel Rogers  Sep 14, 2016  Jan 13, 2017
Printed
Page 61
footnote 2

"If 1% of your requests are 10x the average, " should be "If 1% of your requests are 50x the average, " as it's 5s (99-percentile) / 100ms (average).

Note from the Author or Editor:
Change "10x" to "50x".

Tatz Sekine  Jan 09, 2017  Jan 13, 2017
Printed
Page 77
3rd paragraph

"why each clusters took six or more weeks" "clusters" should be singular.

Note from the Author or Editor:
Replace "why each clusters took six or more weeks" with "why each cluster took six or more weeks".

Ai Vong  Feb 25, 2018  Oct 19, 2018
PDF
Page 79
Figure 7-2.

The figure doesn't correspond to the described process. According to the description, if the test fails the corresponding fix is called and then the test is re-tried. The figure doesn't represent the re-try but there is a direct arrow from the fix box to the next test.

Note from the Author or Editor:
Figure 7-2 should show another call between TestDNSMonitoringConfigExists and FixDNSMonitoringCreateConfig. Perhaps the existing arrow should be double-ended?

Eleni Siakagianni  Jun 05, 2017  Aug 04, 2017
Printed
Page 92
Figure 8-1

Image isn't formatted for black & white printing, so there's no differentiation between box colours.

Anonymous  Apr 21, 2016  Jan 13, 2017
Printed
Page 94
Penultimate paragraph

from Daisuke Yabuki The last sentence reads "our source-based filesystem [Kem11]." In Bibliography on page 507, [Kem11] refers to: C. Kemper, "Build in the Cloud: How the Build System works" However this blog article doesn't mention source-based filesystem. I believe the correct reference should be: N.York, "Build in the Cloud: Accessing Source Code"

Anonymous  Jul 25, 2017  Aug 04, 2017
Printed
Page 111
First paragraph

SNMP abbreviation is decoded with an error. Book says: SNMP (Simple Networking Monitoring Protocol) but should be: SNMP (Simple *Network Management* Protocol) See referenced in book source for the prove: https://technet.microsoft.com/en-us/library/cc776379(v=ws.10).aspx or Wikipedia: https://en.wikipedia.org/wiki/Simple_Network_Management_Protocol

Note from the Author or Editor:
Replace "Simple Networking Monitoring Protocol" with "Simple Network Management Protocol".

Vladimir Rutsky  Nov 06, 2016  Jan 13, 2017
Printed
Page 115
Lemur inset

"nonmonotonically decreasing value" means the value is decreasing, but not monotonically. This is contradicted by the second half of the sentence, which states the meaning that the authors intended: that counter values only increase.

Note from the Author or Editor:
"nonmonotonically decreasing" should be replaced with "monotonically non-decreasing".

Cory Lueninghoener  Aug 21, 2016  Jan 13, 2017
Printed, PDF
Page 116
2nd paragraph

For the results of task:http_requests:rate10m rule, hostnames in the instance labels should be host0 through host4. .

Note from the Author or Editor:
Replace host2 .. host5 with host1 .. host4.

Kazushige Hosokawa  May 20, 2017  Aug 04, 2017
Printed
Page 117
Borgmon code example

Missing ');' in second rule. 'jobwebserver' should be 'job=webserver' in third rule.

Chris Jones
 
Dec 15, 2016  Jan 13, 2017
Printed, PDF
Page 118
2nd and 3rd paragraphs in the Alerting section

"number of errors" should be "number of errors per second" as per the corresponding borgmon rule expression ({var=dc:http_errors:rate10m,job=webserver} > 1).

Note from the Author or Editor:
Append "per second" to "number of errors exceeds 1:".

Kazushige Hosokawa  Jan 09, 2017  Jan 13, 2017
Printed
Page 164
Paragraph 5

Should be a space in 'usable.Retain'.

Anonymous  Apr 21, 2016  Jan 13, 2017
Printed
Page 165
8th paragraph

The scenario starts on page 161 at 2pm on a Friday. The "Managed Incident" replay has Mary returning to work on the day after the incident, which would be a Saturday. Is that intentional?

Note from the Author or Editor:
Unintentional. Friday should be changed to Thursday.

Dave Smith  Jul 03, 2016  Jan 13, 2017
Printed
Page 172
1st paragraph

Lest the impression be left that no names of any time appear in a postmortem, clarifying that "user" means "end-user" or "customer" might be appropriate.

Note from the Author or Editor:
s/user/end-user/

Dave Smith  Jul 03, 2016  Jan 13, 2017
Printed
Page 189
2nd paragraph

In the first sentence of the paragraph, "a subset of servers is upgraded", but few sentences later, "the single modified server can be quickly reverted". Either "a subset of servers" or "single (modified) server", the number of the servers should be the same in those sentences, to revert canary-ing properly.

Note from the Author or Editor:
Replace "the single modified server" with "the modified servers", and in the next sentence, replace "the upgraded server" with "the upgraded servers".

Tatz Sekine  Apr 05, 2017  Aug 04, 2017
Printed
Page 198
1st paragraph

In the explanation for acceptable flakiness calculation, there is "0.99 (the fraction of patches that can be rejected)", but it might be "0.99 (the fraction of patches that should be accepted)".

Note from the Author or Editor:
Replace "that can be rejected" with "that are accepted".

Tatz Sekine  Apr 07, 2017  Aug 04, 2017
PDF
Page 213
4th paragraph

Missing ')' somewhere in the following sentence: This component formulates a machine-readable request (a protocol buffer that can be understood by the Auxon Solver.

Note from the Author or Editor:
Replace 'request (a' with 'request: a'.

Takeo Sawada  Dec 26, 2016  Jan 13, 2017
Printed
Page 239
2nd paragraph

"could yield the following rounds" is misleading; I believe this should read "could yield the following shuffled_backends arrays for each round". I had to read this several times since the previous paragraph says that "we devide /client/ tasks into rounds", and here the elements are in fact backends. I think that the whole section is hard to read and could be rephrased in much simpler words; I'd be happy to provide suggestions on request.

Note from the Author or Editor:
"yield the following rounds:" should read "yield the following shuffled backends:", where backends is in code font.

Patrik Fimml  Aug 24, 2016  Jan 13, 2017
Printed
Page 250
5th (or, last) paragraph

There is no assumption what multiplier K is in this paragraph, but the sentence "backends end up rejecting one request for each request they actually process" implies that the value of K is 2. The next paragraph, in next page, there is another sentence "allowing roughly half of the backend resources to be consumed by ..." which also implies the value of K is 2. Though, few paragraphs later, there is a mention: "We generally prefer the 2x multiplier".

Note from the Author or Editor:
Move paragraph "We've found adaptive ... latency penalties." to be the second-last paragraph in the section, immediately before "One additional consideration ... to be expensive.".

Tatz Sekine  Dec 23, 2016  Jan 13, 2017
PDF
Page 265
Near top

"This is the most important important exercise you should conduct in order to prevent server overload." should probably only have "important" once.

Note from the Author or Editor:
s/most important important exercise/most important exercise/

Omer Zach  Mar 01, 2017  Aug 04, 2017
Printed, PDF
Page 268
1st paragraph of the "Retries" section

As per the Go code on the same page, the number of backend RPCs per logical request should be 20, not 10.

Note from the Author or Editor:
Change the Go code to try 10 times instead of 20. (This avoids follow-on changes that would be needed in subsequent paragraphs were the previous paragraph updated to say 20 retries, matching the code.)

Kazushige Hosokawa  Jan 24, 2017  Aug 04, 2017
Printed, PDF
Page 273
3rd paragraph

"RPCs between deeper layers of the stack" sounds like a single RPC chain, in which case cancellation propagation is not applicable. Maybe it should be something like "subsequent RPCs issued from within the same function", and "until it eventually times out, despite being unable to make progress." should be something like "until it returns or eventually times out, despite the function being unable to make progress".

Note from the Author or Editor:
Revised paragraph on cancellation propagation to read as follows, in new subsection titled "Cancellation propagation": """ Propagating cancellations reduces unneeded or doomed work by advising servers in an RPC call stack that their efforts are no longer necessary. To reduce latency, some systems use "hedged requests" [Dea13] to send RPCs to a primary server, then some time later, send the same request to other instances of the same service in case the primary is slow in responding; once the client has received a response from any server, it sends messages to the other servers to cancel the now-superfluous requests. Those requests may themselves transitively fan out to many other servers, so cancellations should be propagated throughout the entire stack. This approach can also be used to avoid the potential leakage that occurs if an initial RPC has a long deadline, but subsequent critical RPCs between deeper layers of the stack receive errors which can't succeed on retry, or have short deadlines and time out. Using only simple deadline propagation, the initial call continues to use server resources until it eventually times out, despite being doomed to failure. Sending fatal errors or timeouts up the stack and cancelling other RPCs in the call tree prevents unneeded work if the request as a whole can't be fulfilled. """

Kazushige Hosokawa  May 03, 2017  Aug 04, 2017
Printed
Page 298
Figure 23-8. Dueling proposers in Multi-Paxos - "Process 3"

"Process 3 sends a conflicting Prepare messge" Should be "Process 3 sends a conflicting Prepare message"

Note from the Author or Editor:
Replace 'messge' with 'message' in figure 23-8.

Rafael Capella  Jan 07, 2017  Jan 13, 2017
Other Digital Version
303
2nd to last paragraph

Change "minutes" to "seconds". There are 100 10-millisecond periods per second. There are 6000 10-millisecond periods per minute.

Chris Kennelly  May 04, 2016  Jan 13, 2017
Printed
Page 317
1st paragraph

"... run once a month should not be be skipped." have an extra "be". Should be "... run once a month should not be skipped".

Note from the Author or Editor:
Remove excess 'be'.

Rafael Capella  Jan 07, 2017  Jan 13, 2017
Printed, PDF
Page 420
The last paragraph

The sentence "In either case, ..." should not be on the second list item as "either" refers to both types of fires.

Note from the Author or Editor:
Move sentence "In either case, the team needs to build tools to control the burn." out of the list into its own paragraph.

Kazushige Hosokawa  Apr 08, 2017  Aug 04, 2017
PDF
Page 488
Footnote 3

The second sentence of Footnote 3 might be missing some words (e.g., add "for example" before "adding specific ..." and also add some description at the end of the sentence why it is bad).

Note from the Author or Editor:
Add "such as" before "adding specific monitoring/alerting".

Takeo Sawada  Apr 21, 2017  Aug 04, 2017
Printed, PDF
Page 505
Jai13

Jai13 points to https://research.google.com/pubs/pub41761.html, but should point to https://research.google.com/pubs/pub41761.pdf.

Michael Stapelberg  Aug 31, 2016  Jan 13, 2017
Printed
Page 508
Pot16

Paper is now published, Communications of the ACM, Vol. 59 No. 7, Pages 78-87; http://dl.acm.org/citation.cfm?id=2963119.2854146.

Chris Jones
 
Jun 29, 2016  Jan 13, 2017