97 Things Every SRE Should Know

Book description

Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You'll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ.

Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provokingquestions that drive the direction of the field.

Some of the 97 things you should know:

  • "Test Your Disaster Plan"--Tanya Reilly
  • "Integrating Empathy into SRE Tools"--Daniella Niyonkuru
  • "The Best Advice I Can Give to Teams"--Nicole Forsgren
  • "Where to SRE"--Fatema Boxwala
  • "Facing That First Page"--Andrew Louis
  • "I Have an Error Budget, Now What?"--Alex Hidalgo
  • "Get Your Work Recognized: Write a Brag Document"--Julia Evans and Karla Burnett

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. How We Structured the Book
    2. O’Reilly Online Learning
    3. How to Contact Us
    4. Acknowledgments
  2. I. New to SRE
  3. 1. Site Reliability Engineering in Six Words
    1. Alex Hidalgo
  4. 2. Do We Know Why We Really Want Reliability?
    1. Niall Murphy
  5. 3. Building Self-Regulating Processes
    1. Denise Yu
  6. 4. Four Engineers of an SRE Seder
    1. Jacob Scott
  7. 5. The Reliability Stack
    1. Alex Hidalgo
  8. 6. Infrastructure: It’s Where the Power Is
    1. Charity Majors
  9. 7. Thinking About Resilience
    1. Justin Li
  10. 8. Observability in the Development Cycle
    1. Charity Majors and Liz Fong-Jones
  11. 9. There Is No Magic
    1. Bouke van der Bijl
  12. 10. How Wikipedia Is Served to You
    1. Effie Mouzeli
  13. 11. Why You Should Understand (a Little) About TCP
    1. Julia Evans
  14. 12. The Importance of a Management Interface
    1. Salim Virji
  15. 13. When It Comes to Storage, Think Distributed
    1. Salim Virji
  16. 14. The Role of Cardinality
    1. Charity Majors and Liz Fong-Jones
  17. 15. Security Is like an Onion
    1. Lucas Fontes
  18. 16. Use Your Words
    1. Tanya Reilly
  19. 17. Where to SRE
    1. Fatema Boxwala
  20. 18. Dear Future Team
    1. Frances Rees
  21. 19. Sustainability and Burnout
    1. Denise Yu
  22. 20. Don’t Take Advice from Graybeards
    1. John Looney
  23. 21. Facing That First Page
    1. Andrew Louis
  24. II. Zero to One
  25. 22. SRE, at Any Size, Is Cultural
    1. Matthew Huxtable
  26. 23. Everyone Is an SRE in a Small Organization
    1. Matthew Huxtable
  27. 24. Auditing Your Environment for Improvements
    1. Joan O’Callaghan
  28. 25. With Incident Response, Start Small
    1. Thai Wood
  29. 26. Solo SRE: Effecting Large-Scale Change as a Single Individual
    1. Ashley Poole
  30. 27. Design Goals for SLO Measurement
    1. Ben Sigelman
  31. 28. I Have an Error Budget—Now What?
    1. Alex Hidalgo
  32. 29. How to Change Things
    1. Joan O’Callaghan
  33. 30. Methodological Debugging
    1. Avishai Ish-Shalom and Nati Cohen
  34. 31. How Startups Can Build an SRE Mindset
    1. Tamara Miner
  35. 32. Bootstrapping SRE in Enterprises
    1. Vanessa Yiu
  36. 33. It’s Okay Not to Know, and It’s Okay to Be Wrong
    1. Todd Palino
  37. 34. Storytelling Is a Superpower
    1. Anita Clarke
  38. 35. Get Your Work Recognized: Write a Brag Document
    1. Julia Evans and Karla Burnett
  39. III. One to Ten
  40. 36. Making Work Visible
    1. Lorin Hochstein
  41. 37. An Overlooked Engineering Skill
    1. Murali Suriar
  42. 38. Unpacking the On-Call Divide
    1. Jason Hand
  43. 39. The Maestros of Incident Response
    1. Andrew Louis
      1. Stop the Bleeding
      2. What’s Everyone Doing?
  44. 40. Effortless Incident Management
    1. Suhail Patel, Miles Bryant, and Chris Evans
  45. 41. If You’re Doing Runbooks, Do Them Well
    1. Spike Lindsey
  46. 42. Why I Hate Our Playbooks
    1. Frances Rees
  47. 43. What Machines Do Well
    1. Michelle Brush
  48. 44. Integrating Empathy into SRE Tools
    1. Daniella Niyonkuru
  49. 45. Using ChatOps to Implement Empathy
    1. Daniella Niyonkuru
  50. 46. Move Fast to Unbreak Things
    1. Michelle Brush
  51. 47. You Don’t Know for Sure Until It Runs in Production
    1. Ingrid Epure
  52. 48. Sometimes the Fix Is the Problem
    1. Jake Pittis
  53. 49. Legendary
    1. Elise Gale
  54. 50. Metrics Are Not SLIs (The Measure Everything Trap)
    1. Brian Murphy
  55. 51. When SLOs Attack: Pathological SLOs and How to Fix Them
    1. Narayan Desai
  56. 52. Holistic Approach to Product Reliability
    1. Kristine Chen and Bart Ponurkiewicz
  57. 53. In Search of the Lost Time
    1. Ingrid Epure
  58. 54. Unexpected Lessons from Office Hours
    1. Tamara Miner
  59. 55. Building Tools for Internal Customers that They Actually Want to Use
    1. Vinessa Wan
  60. 56. It’s About the Individuals and Interactions
    1. Vinessa Wan
  61. 57. The Human Baseline in SRE
    1. Effie Mouzeli
  62. 58. Remotely Productive or Productively Remote
    1. Avleen Vig
  63. 59. Of Margins and Individuals
    1. Kurt Andersen
  64. 60. The Importance of Margins in Systems
    1. Kurt Andersen
  65. 61. Fewer Spreadsheets, More Napkins
    1. Jacob Bednarz
  66. 62. Sneaking in Your DevOps Deliciously
    1. Vinessa Wan
  67. 63. Effecting SRE Cultural Changes in Enterprises
    1. Vanessa Yiu
  68. 64. To All the SREs I’ve Loved
    1. Felix Glaser
  69. 65. Complex: The Most Overloaded Word in Technology
    1. Laura Nolan
  70. IV. Ten to Hundred
  71. 66. The Best Advice I Can Give to Teams
    1. Nicole Forsgren
  72. 67. Create Your Supporting Artifacts
    1. Daria Barteneva and Eva Parish
  73. 68. The Order of Operations for Getting SLO Buy-In
    1. David K. Rensin
  74. 69. Heroes Are Necessary, but Hero Culture Is Not
    1. Lei Lopez
  75. 70. On-Call Rotations that People Want to Join
    1. Miles Bryant, Chris Evans, and Suhail Patel
  76. 71. Study of Human Factors and Team Culture to Improve Pager Fatigue
    1. Daria Barteneva
  77. 72. Optimize for MTTBTB (Mean Time to Back to Bed)
    1. Spike Lindsey
  78. 73. Mitigating and Preventing Cascading Failures
    1. Rita Lu
  79. 74. On-Call Health: The Metric You Could Be Measuring
    1. Caitie McCaffrey
  80. 75. Helping Leaders Prioritize On-Call Health
    1. Caitie McCaffrey
      1. Bring Quantitative Data
      2. Link SLAs to On-Call Health
      3. Treat On-Call Health like a Feature
      4. Measure Attrition
  81. 76. The SRE as a Diplomat
    1. Johnny Boursiquot
  82. 77. The Forward-Deployed SRE
    1. Johnny Boursiquot
  83. 78. Test Your Disaster Plan
    1. Tanya Reilly
  84. 79. Why Training Matters to an SRE Practice and SRE Matters to Your Training Program
    1. Jennifer Petoff
  85. 80. The Power of Uniformity
    1. Chris Evans, Suhail Patel, and Miles Bryant
  86. 81. Bytes per User Value
    1. Arshia Mufti
  87. 82. Make Your Engineering Blog a Priority
    1. Anita Clarke
  88. 83. Don’t Let Anyone Run Code in Your Context
    1. John Looney
  89. 84. Trading Places: SRE and Product
    1. Shubheksha Jalan
  90. 85. You See Teams, I See Product
    1. Avleen Vig
  91. 86. The Performance Emergency Fund
    1. Dawn Parzych
  92. 87. Important but Not Urgent: Roadmaps for SREs
    1. Laura Nolan
  93. V. The Future of SRE
  94. 88. That 50% Thing
    1. Tanya Reilly
  95. 89. Following the Path of Safety-Critical Systems
    1. Heidy Khlaaf
  96. 90. Applicable and Achievable Static Analysis
    1. Heidy Khlaaf
  97. 91. The Importance of Formal Specification
    1. Hillel Wayne
  98. 92. Risk and Rot in Sociotechnical Systems
    1. Laura Nolan
  99. 93. SRE in Crisis
    1. Niall Murphy
  100. 94. Expected Risk Limitations
    1. Blake Bisset
  101. 95. Beyond Local Risk: Accounting for Angry Birds
    1. Blake Bisset
  102. 96. A Word from Software Safety Nerds
    1. J. Paul Reed
  103. 97. Incidents: A Window into Gaps
    1. Lorin Hochstein
  104. 98. The Third Age of SRE
    1. Björn “Beorn” Rabenstein
  105. Contributors
  106. Index
  107. About the Editors

Product information

  • Title: 97 Things Every SRE Should Know
  • Author(s): Emil Stolarsky, Jaime Woo
  • Release date: November 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492081494