The Site Reliability Workbook
by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne
Preface
When we wrote the original Site Reliability Engineering book, we had a goal: explain the philosophy and the principles of production engineering and operations at Google. The book was our attempt to share our teams’ best practices and lessons with the rest of the computing world. We assumed that the SRE book might appeal to a modest number of engineers working in large, reliability-conscious endeavors, and that both the quantity and the focus of the content would tend to limit the book’s appeal.
As it turned out, we were happily mistaken on both counts.
To our surprise and delight, the SRE book was a best-seller in computing for an exhilarating period after its release, and it was not just being sold or downloaded; it was being read. We received questions from around the world about the book, the team, the practices, and the outcomes. We were asked to speak about chapters, approaches, and incidents. We found ourselves in the unexpected position of having to turn down outside requests because we were out of cycles.
Like most success disasters, the SRE book created an opportunity to respond with human effort (“Hire more people! Do more speaking engagements!”) or with something more scalable. And being SREs, it will surprise few readers that we gravitated toward the latter approach. We decided to write a second SRE book—one that expanded on the content we were most frequently being asked to speak about, and that addressed the most common questions readers had about the first ...