Site Reliability Engineering, 2nd Edition
by Betsy Beyer, Chris Jones, Christof Leng, David Huska, Jennifer Petoff, Niall Richard Murphy
Chapter 3. Safety Engineering for Software: Beyond Reacting to Failure
Traditional incident response focuses on fixing what broke. But what happens when a major outage occurs without a single component failing? This chapter explores how Google is adopting systems thinking and safety engineering to provide a proactive approach to reliability, helping to anticipate and prevent incidents before they happen.
For one eventful day in 2019, if you came to Google Maps to search for places in Japan, you mostly couldn’t find them. As with other outages, SRE got paged and collaborated with product developers to mitigate the incident. We wrote a postmortem once the issue was resolved and Maps in Japan was back to normal. As postmortems go, this was a very good one. It was thoroughly researched, and had a clear and complete explanation of the underlying problems and the sequence of events that led up to the incident. The postmortem included some discussion on how human ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access