Errata

Site Reliability Engineering

Errata for Site Reliability Engineering

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
Other Digital Version -1
Chapter 15 - Postmortem Culture: Learning from Failure

In the online version of the book in Chapter 15 (https://sre.google/sre-book/postmortem-culture/), the link to the Example Postmortem is broken (it points to the current chapter), where it was supposed to link to Appendix D - Example Postmortem https://sre.google/sre-book/example-postmortem/

Anonymous  Aug 31, 2021 
Other Digital Version Chapter 22 - Addressing Cascading Failures
In a section about retries (named "Retries")

https://landing.google.com/sre/sre-book/chapters/addressing-cascading-failures/#retires

There is a snippet of code written in Go which has a few issues.
func exampleRpcCall(client pb.ExampleClient, request pb.Request) *pb.Response {
// Set RPC timeout to 5 seconds.
opts := grpc.WithTimeout(5 * time.Second)

// Try up to 10 times to make the RPC call.
attempts := 10
for attempts > 0 {
conn, err := grpc.Dial(*serverAddr, opts...)
if err != nil {
// Something went wrong in setting up the connection. Try again.
attempts--
continue
}
defer conn.Close()

// Create a client stub and make the RPC call.
client := pb.NewBackendClient(conn)
response, err := client.MakeRequest(context.Background, request)
if err != nil {
// Something went wrong in making the call. Try again.
attempts--
continue
}

return response
}

grpclog.Fatalf("ran out of attempts")
}

First, there is no point in passing client as an argument to the function since it's not used within the function and gets shadowed in the for loop.
Second, despite the fact that grpclog.Fatalf calls os.Exit which stops an execution of program, compiler isn't smart enough to infer this, so exampleRpcCall function will cause compilation error without return statement at the end (see fixed code snippet below).
Third, the name of a function. I think that calling it exampleRPCCall would've been more idiomatic.
Fourth, grpc.WithTimeout returns a single option, not a slice of options, so, there is no point in using '...' in grpc.Dial call. Also, since it's a single option, it would've been better to just call it 'opt' (without 's').
Fifth, in a for loop, a connection is established on each try, which, probably, isn't what author might've intended. Most likely, the connection was meant to be closed after each failure, but their closure is deferred until the end of the function.
Sixth (and the last one), is that in gRPC request in usually a pointer.

Proposed fixed snippet:
func exampleRPCCall(request *pb.Request) *pb.Response {
// Set RPC timeout to 5 seconds.
opt := grpc.WithTimeout(5 * time.Second)

// Try up to 10 times to make the RPC call.
attempts := 10
for attempts > 0 {
conn, err := grpc.Dial(*serverAddr, opt)
if err != nil {
// Something went wrong in setting up the connection. Try again.
attempts--
continue
}

// Create a client stub and make the RPC call.
client := pb.NewBackendClient(conn)
response, err := client.MakeRequest(context.Background, request)
if err != nil {
// Something went wrong in making the call. Try again.
attempts--
conn.Close()
continue
}

return response
}

grpclog.Fatalf("ran out of attempts")
return nil
}

Maksim Fedoseev  Mar 03, 2019 
PDF Page Load Balancing in the Datacenter https://sre.google/sre-book/load-balancing-datacenter/
A few paragraphs below 'A Subset Selection Algorithm: Deterministic Subsetting'

I already got approval from Alejandro Forero Cuervo

Normally, readers should get the idea that there are 10 clients, not 12. It's still good to correct minor issues like this.

Instead of saying,
* `client[2]`, `client[6]`, `client[10]` will use `subset[2]`

* `client[3]`, `client[7]`, `client[11]` will use `subset[3]`

it should be

* `client[2]`, `client[6]` will use `subset[2]`

* `client[3]`, `client[7]` will use `subset[3]`

Xing Feng  Dec 14, 2023 
Printed Page 38
4th paragraph

The text refers to:
[the current published target for Google Compute Engine availability is]
“three and a half nines”—99.95% availability.

This is misleading: an added 5 is not *half* a nine, but rather ~0.3 nines.
You can see a detailed discussion on Wikipedia:
https://en.wikipedia.org/w/index.php?title=High_availability&oldid=1041037687#%22Nines%22

This is a problem because this terminology misrepresents the gap between 99.9% vs. 99.95% vs. 99.99%: the second is a much bigger step.
More accurate terminology is used in metallurgy, which calls this "three nines five".
However, the misleading term "three and a half nines" is common.

I'd suggest phrasing it with a *footnote*, similar to the treatment of the confusion of SLO vs. SLA on page 40:
[the current published target for Google Compute Engine availability is]
“three nines five”—99.95% availability.^1

[1] 99.95% is sometimes misleading referred to as “three and a half nines”: going from 99.9% availability to 99.95% availability is a factor of 2 (0.1% to 0.05% unavailability), but going from 99.95% to 99.99% availability is a factor of 5 (0.05% to 0.01% unavailability), over twice as much.

For reference, Chaos Engineering phrases it as:
https://livebook.manning.com/book/chaos-engineering/chapter-1/v-7/35
Sometimes, we also use phrases like “three nines five” or “three and a half nines” to mean 99.95%, although the latter is not technically correct (going from 99.9% to 99.95% is a factor of 2, but going from 99.9% to 99.99% is a factor of 5).
[There's a typo here: should read "99.9*5*% to 99.99%"; submitted separately.]

Nils Barth  Sep 17, 2021 
Printed, Other Digital Version Page 56
Node and machine section, first paragraph

Can really a container run kernel? Imho, that is surprising and requires a footnote if not an error.

Aharon Haravon  Jun 19, 2021