Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations

Book description

Improve Your Service Scalability and Reliability with SRE

“The techniques and principles of SRE are not only clearly defined here, but also the rationale behind them is explained in a way that will stick. This is not some dry definition, this is practical, usable understanding. . . . I can whole-heartedly recommend this book without any reservation. This is a very good book on an important topic that helps to move the game forward for our discipline!”

From the Foreword by David Farley, Founder and CEO of Continuous Delivery Ltd.

Pioneered by Google to create more scalable and reliable large-scale systems, Site Reliability Engineering (SRE) has become one of today’s most valuable software innovation opportunities. Establishing SRE Foundations is a concise, practical guide that shows how to drive successful SRE adoption in your own organization. Dr. Vladyslav Ukis presents a step-by-step approach to establishing the right cultural, organizational, and technical process foundations, quickly achieving a "minimum viable SRE" and continually improving from there.

Dr. Ukis draws extensively on his own experiences leading an SRE transformation journey at a major healthcare company. Throughout, he answers specific questions that organizations ask about SRE, identifies pitfalls, and shows how to avoid or overcome them. Whatever your role in software development, engineering, or operations, this guide will help you apply SRE to improve what matters most: user and customer experience.

  • Understand how SRE works, its role in software operations, and the challenges of SRE transformation

  • Assess your organizations current operations and readiness for SRE transformation

  • Achieve organizational buy-in and initiate foundational activities, including SLO definitions, alerting, on-call rotations, incident response, and error budget-based decision-making

  • Align organizational structures to support a full SRE transformation

  • Measure the progress and success of your SRE initiative

  • Sustain and advance your SRE transformation beyond the foundations

Table of contents

  1. Cover
  2. Title Page
  3. Contents
  4. Table of Contents
  5. Foreword
  6. Preface
  7. Acknowledgments
  8. About the Author
  9. Part I: Foundations
    1. Chapter 1. Introduction to SRE
      1. 1.1 Why SRE?
      2. 1.2 Alignment Using SRE
      3. 1.3 Why Does SRE Work?
      4. 1.4 Summary
    2. Chapter 2. The Challenge
      1. 2.1 Misalignment
      2. 2.2 Collective Ownership
      3. 2.3 Ownership Using SRE
      4. 2.4 The Challenge Statement
      5. 2.5 Coaching
      6. 2.6 Summary
    3. Chapter 3. SRE Basic Concepts
      1. 3.1 Service Level Indicators
      2. 3.2 Service Level Objectives
      3. 3.3 Error Budgets
      4. 3.4 Error Budget Policies
      5. 3.5 SRE Concept Pyramid
      6. 3.6 Alignment Using the SRE Concept Pyramid
      7. 3.7 Summary
    4. Chapter 4. Assessing the Status Quo
      1. 4.1 Where Is the Organization?
      2. 4.2 Where Are the People?
      3. 4.3 Where Is the Tech?
      4. 4.4 Where Is the Culture?
      5. 4.5 Where Is the Process?
      6. 4.6 SRE Maturity Model
      7. 4.7 Posing Hypotheses
      8. 4.8 Summary
  10. Part II: Running the Transformation
    1. Chapter 5. Achieving Organizational Buy-In
      1. 5.1 Getting People Behind SRE
      2. 5.2 SRE Marketing Funnel
      3. 5.3 SRE Coaches
      4. 5.4 Top-Down Buy-In
      5. 5.5 Bottom-Up Buy-In
      6. 5.6 Lateral Buy-In
      7. 5.7 Buy-In Staggering
      8. 5.8 Team Coaching
      9. 5.9 Traversing the Organization
      10. 5.10 Organizational Coaching
      11. 5.11 Summary
    2. Chapter 6. Laying Down the Foundations
      1. 6.1 Introductory Talks by Team
      2. 6.2 Conveying the Basics
      3. 6.3 SLI Standardization
      4. 6.4 Enabling Logging
      5. 6.5 Teaching the Log Query Language
      6. 6.6 Defining Initial SLOs
      7. 6.7 Default SLOs
      8. 6.8 Providing Basic Infrastructure
      9. 6.9 Engaging Champions
      10. 6.10 Dealing with Detractors
      11. 6.11 Creating Documentation
      12. 6.12 Broadcast Success
      13. 6.13 Summary
    3. Chapter 7. Reacting to Alerts on SLO Breaches
      1. 7.1 Environment Selection
      2. 7.2 Responsibilities
      3. 7.3 Ways of Working
      4. 7.4 Setting Up On-Call Rotations
      5. 7.5 On-Call Management Tools
      6. 7.6 Out-of-Hours On-Call
      7. 7.7 Systematic Knowledge Sharing
      8. 7.8 Broadcast Success
      9. 7.9 Summary
    4. Chapter 8. Implementing Alert Dispatching
      1. 8.1 Alert Escalation
      2. 8.2 Defining an Alert Escalation Policy
      3. 8.3 Defining Stakeholder Groups
      4. 8.4 Triggering Stakeholder Notifications
      5. 8.5 Defining Stakeholder Rings
      6. 8.6 Defining Effective Stakeholder Notifications
      7. 8.7 Getting the Stakeholders Subscribed
      8. 8.8 Broadcast Success
      9. 8.9 Summary
    5. Chapter 9. Implementing Incident Response
      1. 9.1 Incident Response Foundations
      2. 9.2 Incident Priorities
      3. 9.3 Complex Incident Coordination
      4. 9.4 Incident Postmortems
      5. 9.5 Effective Postmortem Criteria
      6. 9.6 Mashing Up the Tools
      7. 9.7 Service Status Broadcast
      8. 9.8 Documenting the Incident Response Process
      9. 9.9 Broadcast Success
      10. 9.10 Summary
    6. Chapter 10. Setting Up an Error Budget Policy
      1. 10.1 Motivation
      2. 10.2 Terminology
      3. 10.3 Error Budget Policy Structure
      4. 10.4 Error Budget Policy Conditions
      5. 10.5 Error Budget Policy Consequences
      6. 10.6 Error Budget Policy Governance
      7. 10.7 Extending the Error Budget Policy
      8. 10.8 Agreeing to the Error Budget Policy
      9. 10.9 Storing the Error Budget Policy
      10. 10.10 Enacting the Error Budget Policy
      11. 10.11 Reviewing the Error Budget Policy
      12. 10.12 Related Concepts
      13. 10.13 Summary
    7. Chapter 11. Enabling Error Budget–Based Decision-Making
      1. 11.1 Reliability Decision-Making Taxonomy
      2. 11.2 Implementing SRE Indicators
      3. 11.3 Dimensions of SRE Indicators
      4. 11.4 Process Indicators, Not People KPIs
      5. 11.5 Decisions Versus Indicators
      6. 11.6 Decision-Making Workflows
      7. 11.7 Summary
    8. Chapter 12. Implementing Organizational Structure
      1. 12.1 SRE Principles Versus Organizational Structure
      2. 12.2 Who Builds It, Who Runs It?
      3. 12.3 Cost Optimization
      4. 12.4 Team Topologies
      5. 12.5 Choosing a Model
      6. 12.6 A New Role: SRE
      7. 12.7 SRE Career Path
      8. 12.8 Communicating the Chosen Model
      9. 12.9 Introducing the Chosen Model
      10. 12.10 Summary
  11. Part III: Measuring and Sustaining the Transformation
    1. Chapter 13. Measuring the SRE Transformation
      1. 13.1 Testing Transformation Hypotheses
      2. 13.2 Outages Not Detected Internally
      3. 13.3 Services Exhausting Error Budgets Prematurely
      4. 13.4 Executives’ Perceptions
      5. 13.5 Reliability Perception by Users and Partners
      6. 13.6 Summary
    2. Chapter 14. Sustaining the SRE Movement
      1. 14.1 Maturing the SRE CoP
      2. 14.2 SRE Minutes
      3. 14.3 Availability Newsletter
      4. 14.4 SRE Column in the Engineering Blog
      5. 14.5 Promote Long-Form SRE Wiki Articles
      6. 14.6 SRE Broadcasting
      7. 14.7 Combining SRE and CD Indicators
      8. 14.8 SRE Feedback Loops
      9. 14.9 New Hypotheses
      10. 14.10 Providing Learning Opportunities
      11. 14.11 Supporting SRE Coaches
      12. 14.12 Summary
    3. Chapter 15. The Road Ahead
      1. 15.1 Service Catalog
      2. 15.2 SLAs
      3. 15.3 Regulatory Compliance
      4. 15.4 SRE Infrastructure
      5. 15.5 Game Days
  12. Appendix: Topics for Quick Reference
    1. SRE Wiki Content
    2. Runbook Template Content
    3. Incident Response Process Content
    4. Postmortem Lifecycle
    5. Operations Teams’ Responsibilities
    6. SRE Online Communities
    7. SRE Newsletters
    8. SRE Conferences
    9. SRE Indicators
    10. Decision-Making Workflows

Product information

  • Title: Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations
  • Author(s): Vladyslav Ukis
  • Release date: September 2022
  • Publisher(s): Addison-Wesley Professional
  • ISBN: 9780137424887