Book description
Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You'll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ.
Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provokingquestions that drive the direction of the field.
Some of the 97 things you should know:
- "Test Your Disaster Plan"--Tanya Reilly
- "Integrating Empathy into SRE Tools"--Daniella Niyonkuru
- "The Best Advice I Can Give to Teams"--Nicole Forsgren
- "Where to SRE"--Fatema Boxwala
- "Facing That First Page"--Andrew Louis
- "I Have an Error Budget, Now What?"--Alex Hidalgo
- "Get Your Work Recognized: Write a Brag Document"--Julia Evans and Karla Burnett
Publisher resources
Table of contents
- Preface
- I. New to SRE
- 1. Site Reliability Engineering in Six Words
- 2. Do We Know Why We Really Want Reliability?
- 3. Building Self-Regulating Processes
- 4. Four Engineers of an SRE Seder
- 5. The Reliability Stack
- 6. Infrastructure: It’s Where the Power Is
- 7. Thinking About Resilience
- 8. Observability in the Development Cycle
- 9. There Is No Magic
- 10. How Wikipedia Is Served to You
- 11. Why You Should Understand (a Little) About TCP
- 12. The Importance of a Management Interface
- 13. When It Comes to Storage, Think Distributed
- 14. The Role of Cardinality
- 15. Security Is like an Onion
- 16. Use Your Words
- 17. Where to SRE
- 18. Dear Future Team
- 19. Sustainability and Burnout
- 20. Don’t Take Advice from Graybeards
- 21. Facing That First Page
- II. Zero to One
- 22. SRE, at Any Size, Is Cultural
- 23. Everyone Is an SRE in a Small Organization
- 24. Auditing Your Environment for Improvements
- 25. With Incident Response, Start Small
- 26. Solo SRE: Effecting Large-Scale Change as a Single Individual
- 27. Design Goals for SLO Measurement
- 28. I Have an Error Budget—Now What?
- 29. How to Change Things
- 30. Methodological Debugging
- 31. How Startups Can Build an SRE Mindset
- 32. Bootstrapping SRE in Enterprises
- 33. It’s Okay Not to Know, and It’s Okay to Be Wrong
- 34. Storytelling Is a Superpower
- 35. Get Your Work Recognized: Write a Brag Document
- III. One to Ten
- 36. Making Work Visible
- 37. An Overlooked Engineering Skill
- 38. Unpacking the On-Call Divide
- 39. The Maestros of Incident Response
- 40. Effortless Incident Management
- 41. If You’re Doing Runbooks, Do Them Well
- 42. Why I Hate Our Playbooks
- 43. What Machines Do Well
- 44. Integrating Empathy into SRE Tools
- 45. Using ChatOps to Implement Empathy
- 46. Move Fast to Unbreak Things
- 47. You Don’t Know for Sure Until It Runs in Production
- 48. Sometimes the Fix Is the Problem
- 49. Legendary
- 50. Metrics Are Not SLIs (The Measure Everything Trap)
- 51. When SLOs Attack: Pathological SLOs and How to Fix Them
- 52. Holistic Approach to Product Reliability
- 53. In Search of the Lost Time
- 54. Unexpected Lessons from Office Hours
- 55. Building Tools for Internal Customers that They Actually Want to Use
- 56. It’s About the Individuals and Interactions
- 57. The Human Baseline in SRE
- 58. Remotely Productive or Productively Remote
- 59. Of Margins and Individuals
- 60. The Importance of Margins in Systems
- 61. Fewer Spreadsheets, More Napkins
- 62. Sneaking in Your DevOps Deliciously
- 63. Effecting SRE Cultural Changes in Enterprises
- 64. To All the SREs I’ve Loved
- 65. Complex: The Most Overloaded Word in Technology
- IV. Ten to Hundred
- 66. The Best Advice I Can Give to Teams
- 67. Create Your Supporting Artifacts
- 68. The Order of Operations for Getting SLO Buy-In
- 69. Heroes Are Necessary, but Hero Culture Is Not
- 70. On-Call Rotations that People Want to Join
- 71. Study of Human Factors and Team Culture to Improve Pager Fatigue
- 72. Optimize for MTTBTB (Mean Time to Back to Bed)
- 73. Mitigating and Preventing Cascading Failures
- 74. On-Call Health: The Metric You Could Be Measuring
- 75. Helping Leaders Prioritize On-Call Health
- 76. The SRE as a Diplomat
- 77. The Forward-Deployed SRE
- 78. Test Your Disaster Plan
- 79. Why Training Matters to an SRE Practice and SRE Matters to Your Training Program
- 80. The Power of Uniformity
- 81. Bytes per User Value
- 82. Make Your Engineering Blog a Priority
- 83. Don’t Let Anyone Run Code in Your Context
- 84. Trading Places: SRE and Product
- 85. You See Teams, I See Product
- 86. The Performance Emergency Fund
- 87. Important but Not Urgent: Roadmaps for SREs
- V. The Future of SRE
- 88. That 50% Thing
- 89. Following the Path of Safety-Critical Systems
- 90. Applicable and Achievable Static Analysis
- 91. The Importance of Formal Specification
- 92. Risk and Rot in Sociotechnical Systems
- 93. SRE in Crisis
- 94. Expected Risk Limitations
- 95. Beyond Local Risk: Accounting for Angry Birds
- 96. A Word from Software Safety Nerds
- 97. Incidents: A Window into Gaps
- 98. The Third Age of SRE
- Contributors
- Index
- About the Editors
Product information
- Title: 97 Things Every SRE Should Know
- Author(s):
- Release date: November 2020
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781492081494
You might also like
book
Grokking Algorithms
Grokking Algorithms is a friendly take on this core computer science topic. In it, you'll learn …
book
Software Engineering at Google
Today, software engineers need to know not only how to program effectively but also how to …
book
Codeless Data Structures and Algorithms : Learn DSA Without Writing a Single Line of Code
In the era of self-taught developers and programmers, essential topics in the industry are frequently learned …
book
Istio: Up and Running
You did it. You successfully transformed your application into a microservices architecture. But now that you’re …