Book description
Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You'll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ.
Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provokingquestions that drive the direction of the field.
Some of the 97 things you should know:
- "Test Your Disaster Plan"--Tanya Reilly
- "Integrating Empathy into SRE Tools"--Daniella Niyonkuru
- "The Best Advice I Can Give to Teams"--Nicole Forsgren
- "Where to SRE"--Fatema Boxwala
- "Facing That First Page"--Andrew Louis
- "I Have an Error Budget, Now What?"--Alex Hidalgo
- "Get Your Work Recognized: Write a Brag Document"--Julia Evans and Karla Burnett
Publisher resources
Table of contents
- Preface
- I. New to SRE
- 1. Site Reliability Engineering in Six Words
- 2. Do We Know Why We Really Want Reliability?
- 3. Building Self-Regulating Processes
- 4. Four Engineers of an SRE Seder
- 5. The Reliability Stack
- 6. Infrastructure: It’s Where the Power Is
- 7. Thinking About Resilience
- 8. Observability in the Development Cycle
- 9. There Is No Magic
- 10. How Wikipedia Is Served to You
- 11. Why You Should Understand (a Little) About TCP
- 12. The Importance of a Management Interface
- 13. When It Comes to Storage, Think Distributed
- 14. The Role of Cardinality
- 15. Security Is like an Onion
- 16. Use Your Words
- 17. Where to SRE
- 18. Dear Future Team
- 19. Sustainability and Burnout
- 20. Don’t Take Advice from Graybeards
- 21. Facing That First Page
- II. Zero to One
- 22. SRE, at Any Size, Is Cultural
- 23. Everyone Is an SRE in a Small Organization
- 24. Auditing Your Environment for Improvements
- 25. With Incident Response, Start Small
- 26. Solo SRE: Effecting Large-Scale Change as a Single Individual
- 27. Design Goals for SLO Measurement
- 28. I Have an Error Budget—Now What?
- 29. How to Change Things
- 30. Methodological Debugging
- 31. How Startups Can Build an SRE Mindset
- 32. Bootstrapping SRE in Enterprises
- 33. It’s Okay Not to Know, and It’s Okay to Be Wrong
- 34. Storytelling Is a Superpower
- 35. Get Your Work Recognized: Write a Brag Document
- III. One to Ten
- 36. Making Work Visible
- 37. An Overlooked Engineering Skill
- 38. Unpacking the On-Call Divide
- 39. The Maestros of Incident Response
- 40. Effortless Incident Management
- 41. If You’re Doing Runbooks, Do Them Well
- 42. Why I Hate Our Playbooks
- 43. What Machines Do Well
- 44. Integrating Empathy into SRE Tools
- 45. Using ChatOps to Implement Empathy
- 46. Move Fast to Unbreak Things
- 47. You Don’t Know for Sure Until It Runs in Production
- 48. Sometimes the Fix Is the Problem
- 49. Legendary
- 50. Metrics Are Not SLIs (The Measure Everything Trap)
- 51. When SLOs Attack: Pathological SLOs and How to Fix Them
- 52. Holistic Approach to Product Reliability
- 53. In Search of the Lost Time
- 54. Unexpected Lessons from Office Hours
- 55. Building Tools for Internal Customers that They Actually Want to Use
- 56. It’s About the Individuals and Interactions
- 57. The Human Baseline in SRE
- 58. Remotely Productive or Productively Remote
- 59. Of Margins and Individuals
- 60. The Importance of Margins in Systems
- 61. Fewer Spreadsheets, More Napkins
- 62. Sneaking in Your DevOps Deliciously
- 63. Effecting SRE Cultural Changes in Enterprises
- 64. To All the SREs I’ve Loved
- 65. Complex: The Most Overloaded Word in Technology
- IV. Ten to Hundred
- 66. The Best Advice I Can Give to Teams
- 67. Create Your Supporting Artifacts
- 68. The Order of Operations for Getting SLO Buy-In
- 69. Heroes Are Necessary, but Hero Culture Is Not
- 70. On-Call Rotations that People Want to Join
- 71. Study of Human Factors and Team Culture to Improve Pager Fatigue
- 72. Optimize for MTTBTB (Mean Time to Back to Bed)
- 73. Mitigating and Preventing Cascading Failures
- 74. On-Call Health: The Metric You Could Be Measuring
- 75. Helping Leaders Prioritize On-Call Health
- 76. The SRE as a Diplomat
- 77. The Forward-Deployed SRE
- 78. Test Your Disaster Plan
- 79. Why Training Matters to an SRE Practice and SRE Matters to Your Training Program
- 80. The Power of Uniformity
- 81. Bytes per User Value
- 82. Make Your Engineering Blog a Priority
- 83. Don’t Let Anyone Run Code in Your Context
- 84. Trading Places: SRE and Product
- 85. You See Teams, I See Product
- 86. The Performance Emergency Fund
- 87. Important but Not Urgent: Roadmaps for SREs
- V. The Future of SRE
- 88. That 50% Thing
- 89. Following the Path of Safety-Critical Systems
- 90. Applicable and Achievable Static Analysis
- 91. The Importance of Formal Specification
- 92. Risk and Rot in Sociotechnical Systems
- 93. SRE in Crisis
- 94. Expected Risk Limitations
- 95. Beyond Local Risk: Accounting for Angry Birds
- 96. A Word from Software Safety Nerds
- 97. Incidents: A Window into Gaps
- 98. The Third Age of SRE
- Contributors
- Index
- About the Editors
Product information
- Title: 97 Things Every SRE Should Know
- Author(s):
- Release date: November 2020
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781492081494
You might also like
book
97 Things Every Cloud Engineer Should Know
If you create, manage, operate, or configure systems running in the cloud, you're a cloud engineer--even …
book
97 Things Every Programmer Should Know
Tap into the wisdom of experts to learn what every programmer should know, no matter what …
book
97 Things Every Data Engineer Should Know
Take advantage of today's sky-high demand for data engineers. With this in-depth book, current and aspiring …
book
97 Things Every Engineering Manager Should Know
Tap into the wisdom of experts to learn what every engineering manager should know. With 97 …