book

97 Things Every SRE Should Know

by Emil Stolarsky, Jaime Woo

November 2020

Beginner to intermediate

250 pages

7h 41m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
How We Structured the BookO’Reilly Online LearningHow to Contact UsAcknowledgments
I. New to SRE
1. Site Reliability Engineering in Six Words
Alex Hidalgo
2. Do We Know Why We Really Want Reliability?
Niall Murphy
3. Building Self-Regulating Processes
Denise Yu
4. Four Engineers of an SRE Seder
Jacob Scott
5. The Reliability Stack
Alex Hidalgo
6. Infrastructure: It’s Where the Power Is
Charity Majors
7. Thinking About Resilience
Justin Li
8. Observability in the Development Cycle
Charity Majors and Liz Fong-Jones

9. There Is No Magic
Bouke van der Bijl
10. How Wikipedia Is Served to You
Effie Mouzeli
11. Why You Should Understand (a Little) About TCP
Julia Evans
12. The Importance of a Management Interface
Salim Virji
13. When It Comes to Storage, Think Distributed
Salim Virji
14. The Role of Cardinality
Charity Majors and Liz Fong-Jones
15. Security Is like an Onion
Lucas Fontes
16. Use Your Words
Tanya Reilly
17. Where to SRE
Fatema Boxwala
18. Dear Future Team
Frances Rees
19. Sustainability and Burnout
Denise Yu
20. Don’t Take Advice from Graybeards
John Looney
21. Facing That First Page
Andrew Louis
II. Zero to One
22. SRE, at Any Size, Is Cultural
Matthew Huxtable
23. Everyone Is an SRE in a Small Organization
Matthew Huxtable
24. Auditing Your Environment for Improvements
Joan O’Callaghan
25. With Incident Response, Start Small
Thai Wood
26. Solo SRE: Effecting Large-Scale Change as a Single Individual
Ashley Poole
27. Design Goals for SLO Measurement
Ben Sigelman
28. I Have an Error Budget—Now What?
Alex Hidalgo
29. How to Change Things
Joan O’Callaghan
30. Methodological Debugging
Avishai Ish-Shalom and Nati Cohen
31. How Startups Can Build an SRE Mindset
Tamara Miner
32. Bootstrapping SRE in Enterprises
Vanessa Yiu
33. It’s Okay Not to Know, and It’s Okay to Be Wrong
Todd Palino
34. Storytelling Is a Superpower
Anita Clarke
35. Get Your Work Recognized: Write a Brag Document
Julia Evans and Karla Burnett
III. One to Ten
36. Making Work Visible
Lorin Hochstein
37. An Overlooked Engineering Skill
Murali Suriar
38. Unpacking the On-Call Divide
Jason Hand
39. The Maestros of Incident Response
Andrew LouisStop the BleedingWhat’s Everyone Doing?
40. Effortless Incident Management
Suhail Patel, Miles Bryant, and Chris Evans
41. If You’re Doing Runbooks, Do Them Well
Spike Lindsey
42. Why I Hate Our Playbooks
Frances Rees
43. What Machines Do Well
Michelle Brush
44. Integrating Empathy into SRE Tools
Daniella Niyonkuru
45. Using ChatOps to Implement Empathy
Daniella Niyonkuru
46. Move Fast to Unbreak Things
Michelle Brush
47. You Don’t Know for Sure Until It Runs in Production
Ingrid Epure
48. Sometimes the Fix Is the Problem
Jake Pittis
49. Legendary
Elise Gale
50. Metrics Are Not SLIs (The Measure Everything Trap)
Brian Murphy
51. When SLOs Attack: Pathological SLOs and How to Fix Them
Narayan Desai
52. Holistic Approach to Product Reliability
Kristine Chen and Bart Ponurkiewicz
53. In Search of the Lost Time
Ingrid Epure
54. Unexpected Lessons from Office Hours
Tamara Miner
55. Building Tools for Internal Customers that They Actually Want to Use
Vinessa Wan
56. It’s About the Individuals and Interactions
Vinessa Wan
57. The Human Baseline in SRE
Effie Mouzeli
58. Remotely Productive or Productively Remote
Avleen Vig
59. Of Margins and Individuals
Kurt Andersen
60. The Importance of Margins in Systems
Kurt Andersen
61. Fewer Spreadsheets, More Napkins
Jacob Bednarz
62. Sneaking in Your DevOps Deliciously
Vinessa Wan
63. Effecting SRE Cultural Changes in Enterprises
Vanessa Yiu
64. To All the SREs I’ve Loved
Felix Glaser
65. Complex: The Most Overloaded Word in Technology
Laura Nolan
IV. Ten to Hundred
66. The Best Advice I Can Give to Teams
Nicole Forsgren
67. Create Your Supporting Artifacts
Daria Barteneva and Eva Parish
68. The Order of Operations for Getting SLO Buy-In
David K. Rensin
69. Heroes Are Necessary, but Hero Culture Is Not
Lei Lopez
70. On-Call Rotations that People Want to Join
Miles Bryant, Chris Evans, and Suhail Patel
71. Study of Human Factors and Team Culture to Improve Pager Fatigue
Daria Barteneva
72. Optimize for MTTBTB (Mean Time to Back to Bed)
Spike Lindsey
73. Mitigating and Preventing Cascading Failures
Rita Lu
74. On-Call Health: The Metric You Could Be Measuring
Caitie McCaffrey
75. Helping Leaders Prioritize On-Call Health
Caitie McCaffreyBring Quantitative DataLink SLAs to On-Call HealthTreat On-Call Health like a FeatureMeasure Attrition
76. The SRE as a Diplomat
Johnny Boursiquot
77. The Forward-Deployed SRE
Johnny Boursiquot
78. Test Your Disaster Plan
Tanya Reilly
79. Why Training Matters to an SRE Practice and SRE Matters to Your Training Program
Jennifer Petoff
80. The Power of Uniformity
Chris Evans, Suhail Patel, and Miles Bryant
81. Bytes per User Value
Arshia Mufti
82. Make Your Engineering Blog a Priority
Anita Clarke
83. Don’t Let Anyone Run Code in Your Context
John Looney
84. Trading Places: SRE and Product
Shubheksha Jalan
85. You See Teams, I See Product
Avleen Vig
86. The Performance Emergency Fund
Dawn Parzych
87. Important but Not Urgent: Roadmaps for SREs
Laura Nolan
V. The Future of SRE
88. That 50% Thing
Tanya Reilly
89. Following the Path of Safety-Critical Systems
Heidy Khlaaf
90. Applicable and Achievable Static Analysis
Heidy Khlaaf
91. The Importance of Formal Specification
Hillel Wayne
92. Risk and Rot in Sociotechnical Systems
Laura Nolan
93. SRE in Crisis
Niall Murphy
94. Expected Risk Limitations
Blake Bisset
95. Beyond Local Risk: Accounting for Angry Birds
Blake Bisset
96. A Word from Software Safety Nerds
J. Paul Reed
97. Incidents: A Window into Gaps
Lorin Hochstein
98. The Third Age of SRE
Björn “Beorn” Rabenstein
Contributors
Index
About the Editors

Content preview from 97 Things Every SRE Should Know

Chapter 78. Test Your Disaster Plan

Tanya Reilly

Squarespace

Systems fail. That’s fine. Site reliability is a whole discipline that specializes in anticipating and mitigating failure. We build systems that are observable, introspectable, and recoverable that limit the blast radius of an outage. We design for failure.

Failure planning often includes fallback plans, alternate pathways through our code, and systems or processes that we’ll use when our regular mechanisms fail. A client may retry a failed request, for example, hoping it hits a healthier replica next time. A leader-elected system may move leadership away from an unresponsive server. Fallback plans sometimes involve humans; every time we page an on-caller or take some action in response to an outage, we’re executing a fallback plan.

Our regular pathways are constantly in use. We know they work, and we notice when they fail. Many of our fallback plans are also well-traveled, running so frequently that we’ll find out if they have problems. What about the less-traveled paths? If we only use them during emergencies, we might not find out they don’t work until we really need them.

An extreme illustration of this problem is an industry classic: the gently rotting disaster recovery site. A team anticipates a massive failure of their primary site and builds a replica of their system in another region or another data center. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

97 Things Every Engineering Manager Should Know

Publisher Resources

ISBN: 9781492081487Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

97 Things Every SRE Should Know

by Emil Stolarsky, Jaime Woo

Chapter 78. Test Your Disaster Plan

Tanya Reilly

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.