Becoming SRE

Book description

Do you wish the existing books on site reliability engineering started at the beginning? Do you wish someone would walk you through how to become an SRE, how to think like an SRE, or how to build and grow a successful SRE function in your organization?

Becoming SRE addresses all of these needs and more with three interconnected sections: the essential groundwork for understanding SRE and SRE culture, advice for individuals on becoming an SRE, and guidance for organizations on creating and developing a thriving SRE practice.

Acting as your personal and personable guide, author David Blank-Edelman takes you through subjects like:

  • SRE mindset, SRE culture, and SRE advocacy
  • What you need to get started and hired in SRE and what the job will be like when you get there
  • What you need to bring SRE into an organization and what is required for a good organizational fit so it can thrive there
  • How to work with your business folks and management around SRE
  • How SRE can grow and mature in an organization over time

Ready to become an SRE or introduce SRE into your organization? This book is here to help.

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Where Are You Right Now?
    2. Navigating This Book
    3. We Are Going to Need a Bigger Boat
    4. I’m Not the Lorax
    5. Ready?
    6. Convention Used in This Book
    7. O’Reilly Online Learning
    8. How to Contact Us
    9. Acknowledgments
    10. Coping
  2. I. Introduction to SRE
  3. 1. First Things First
    1. What Is SRE?
      1. Reliability
      2. Appropriate
      3. Sustainable
      4. (Other Words)
    2. Origin Story
    3. SRE and Its Relationship to DevOps
      1. Part 1: SRE Implements Class DevOps
      2. Part 2: SRE Is to Reliability as DevOps Is to Delivery
      3. Part 3: It’s All About the Direction of Attention
    4. Onward to SRE Fundamentals
  4. 2. SRE Mindset
    1. Zooming Out to Maintain a Systems Perspective
    2. Creating and Nurturing Feedback Loops
    3. Keeping the Focus on the Customer
    4. Relationships (to People and Things)
      1. SRE’s Relationship to (Other) People
      2. SRE’s Relationship to Failure and Errors
    5. The Mindset in Motion
  5. 3. SRE Culture
    1. Happy Fish, um, People
    2. How to Create a Supportive Culture for SRE
      1. Culture as a Vehicle or a Lever
      2. What Do You Want SRE to Be/Do?
      3. Thinking About Assembling the Culture You Want and Need
      4. I Still Don’t Know Where to Start
      5. Nurturing Your Nascent SRE Culture
      6. Keep On Keeping On
  6. 4. Talking About SRE (SRE Advocacy)
    1. Why It Matters, Even Early in Your Experience with SRE
    2. When It Matters
    3. Get Your Story (and Audience) Straight
      1. Some Story Ideas
      2. Other People’s Stories
      3. Secondary Stories
      4. The Challenges the Stories Present
    4. One Last Tip
  7. II. Becoming SRE for the Individual
  8. 5. Preparing to Become an SRE
    1. Do You Need to Know How to Code?
    2. Do You Need a Computer Science Degree?
    3. Fundamentals
      1. Single/Basic Systems (and Their Failure Modes)
      2. Distributed Systems (and Their Failure Modes)
    4. Statistics and Data Visualization
    5. Storytelling
      1. Be a Good Person
    6. Bonus Round
      1. Non-Abstract Large System Design (NALSD)
      2. Resilience Engineering
      3. Chaos Engineering and Performance Engineering
      4. Machine Learning and Artificial Intelligence
    7. What Else?
  9. 6. Getting to SRE from…       
    1. Are You Already an SRE?
    2. From Student to SRE
    3. From Dev/SWE to SRE
    4. From Sysadmin/IT to SRE
    5. Generic Advice
      1. Technical Role X to SRE
      2. Nontechnical Role X to SRE
      3. Track Your Progress to Keep On Keeping On
  10. 7. Hints for Getting Hired as an SRE
    1. Scrutinizing the Job Posting
      1. Preparing for an SRE Interview
      2. What to Ask at the SRE Interview
      3. Win!
  11. 8. A Day in the Life of an SRE
    1. Modes of an SRE’s Day
      1. Incident/Outage Mode
      2. Postincident Learning Mode
      3. Builder/Project/Learn Mode
      4. Architecture Mode
      5. Management Mode
      6. Planning Mode
      7. Collaboration Mode
      8. Recovery and Self-Care Mode
    2. Balance
    3. Make a Day in the Life a Good Day
  12. 9. Establishing a Relationship to Toil
    1. Defining Toil with More Precision
    2. Whose Toil Are We Talking About?
    3. Why Do SREs Care About Toil?
    4. The Dynamics of Toil: Early Versus Established
    5. Dealing with Toil
      1. Intermediate to Advanced Toil Reduction
      2. What Are You Going to Do About It?
  13. 10. Learning from Failure
    1. Talking About Failure
    2. Postincident Reviews
      1. Postincident Reviews: The Basics
      2. Postincident Reviews: The Process
      3. Postincident Reviews: Common Traps
    3. Learning from Failure Through Resilience Engineering
    4. Learning from Failure via Chaos Engineering
    5. Learning from Failure: Next Steps
  14. III. Becoming SRE for the Organization
  15. 11. Organizational Factors for Success
    1. Contributing Factor 1: What’s the Problem?
    2. Contributing Factor 2: What Is the Org Willing to Do to Get There?
    3. Contributing Factor 3: Does the Org Have the Requisite Patience?
    4. Contributing Factor 4: Can We Collaborate?
    5. Contributing Factor 5: Does the Org Make Decisions Based on Data?
    6. Contributing Factor 6: Can the Org Learn and Act on What It Learns?
    7. Contributing Factor 7: Can You Make a Difference?
    8. Contributing Factor 8: Can You See (and Address) the Friction in the System?
    9. The Fine Print
    10. It’s All About Organizational Values
  16. 12. How SRE Can Fail
    1. Contributing Factor 1: Title Flipping to Create SREs
    2. Contributing Factor 2: Converting Tier 3 Support to SRE
    3. Contributing Factor 3: On Call and That’s All
    4. Contributing Factor 4: Wrong Org Chart
    5. Contributing Factor 5: SRE by Rote
    6. Contributing Factor 6: Gatekeeping
    7. Contributing Factor 7: Death Through Success
    8. Contributing Factor 8: A Collection of Smaller Factors
    9. How to “SRE” Your SRE Failure
  17. 13. SRE from a Business Perspective
    1. Communicating About SRE
      1. Talking to the Business About Reliability
      2. Selling SRE
      3. Communicating Success Back to the Business
      4. Proving the Success of an SRE Group to Others
    2. Budgeting for SRE
      1. First Budget Request
      2. Talking About Funding
      3. Re-Up Conversations
      4. Funding Models
    3. SRE Alignment
      1. Models for Engagement
      2. Why Not the Embedded Model? Why a Separate Org?
      3. Avoiding the Pager Monkey or Toil Bucket Traps
    4. SRE Teams
      1. Choosing Headcount Sizes
      2. How Do You Know When an SRE Team Might Be in Trouble?
      3. Alert Noise as a Signal of Team Health
      4. SRE Promotions
      5. Turning Teams Down
    5. From the Author: I Would Like to Hear from You
  18. 14. The Dickerson Hierarchy of Reliability (A Good Place to Start)
    1. The Dickerson Hierarchy of Reliability
      1. Level 1: Monitoring/Observability
      2. Level 2: Incident Response
      3. Level 3: Postincident Review
      4. Level 4: Testing/Release (Deployment)
      5. Level 5: Provisioning/Capacity Planning
      6. Levels 6 and 7: Development Process and Product Design
    2. Wrong Turns
      1. You Know You’ve Taken a Wrong Turn When…
    3. Positive Signs
  19. 15. Fitting SRE into Your Organization
    1. Pre-role and Pre-team Practices
    2. Integration Models
      1. Centralized/Partnered Model
      2. Distributed/Embedded Model
      3. Hybrid Model
      4. How to Choose Between These Models
    3. Creating and Nurturing the Right Feedback Loops
      1. Feedback Loops and Data
      2. Feedback Loops and Iteration
      3. Feedback Loops and Planning for Iteration
      4. How and Where to Insert These Feedback Loops into the Organization
    4. Signs of Success
  20. 16. SRE Organizational Evolutionary Stages
    1. Stage 1: The Firefighter
    2. Stage 2: The Gatekeeper
    3. Stage 3: The Advocate
    4. Stage 4: The Partner
    5. Stage 5: The Engineer
    6. Caveat Implementer
  21. 17. Growing SRE in Your Org
    1. How Do You Know When to Scale?
    2. Scaling 0 to 1
    3. Scaling 1 to 6
    4. Scaling 6 to 18
    5. Scaling 18 to 48
    6. Scaling 48 to 108 (and Beyond)
    7. Growing SRE’s Leadership Representation
  22. 18. Conclusion
  23. A. Letters to a Young SRE (Apologies to Rilke)
    1. John Amori
    2. Fred Hebert
    3. Aju Tamang
    4. Daniel Gentleman
    5. Joanna Wijntjes
    6. Fabrizio Waldner
    7. Graham Poulter
    8. Jamie Wilkinson
    9. Andrew Howden
    10. Pedro Alves
    11. Balasundaram N
    12. Eduardo Spotti
    13. Ian Bartholomew
    14. Olivier Duquesne
    15. Ralph Pritchard
    16. David Caudill
    17. Alex Hidalgo
    18. Effie Mouzeli
  24. B. Advice from Former SREs
    1. Dina Levitan
    2. Sara Smollett
    3. Andrew Fong
    4. Scott MacFiggen
  25. C. SRE Resources
    1. Core Books
    2. “SRE and…” Books
    3. Events
      1. SREcon
      2. Vendor SRE Single-Day Events
      3. DevOps Event Tracks/Sessions
      4. SRE-Adjacent Niche Events
    4. SRE Video Content
    5. SRE-Specific Podcasts
    6. SRE-Specific Email Newsletters
    7. Online Forums
    8. Historical Document
    9. Curated Link Collections
  26. Index
  27. About the Author

Product information

  • Title: Becoming SRE
  • Author(s): David N. Blank-Edelman
  • Release date: February 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492090557