SLO Adoption and Usage in Site Reliability Engineering

Book description

Site Reliability Engineering (SRE)—a framework for managing enterprise software systems, first developed at Google—helps lower operational costs, enhance development productivity, and increase feature release. But if service-level objectives (SLOs) aren’t part of your SRE strategy, you’re leaving value on the table. This practical report details why and how to make SLOs, service-level indicators (SLIs), and error budgets critical components of your SRE practice.

Drawing on results from Google’s recent SLO Adoption and Usage Survey, along with real-world case studies, this guide walks you through defining and determining an acceptable level of reliability and using it to set expectations for stability and better manage system changes. Whether you’re an SRE, executive, developer, or architect, you’ll learn how to improve your SRE practices by taking an SLO and error-based approach to measuring and managing your service.

  • Understand common service-level terminology, including objectives, indicators, agreements, and error budgets
  • Build SLOs and SLIs step by step
  • Use error budgets to align and jointly make decisions about reliability and development velocity
  • See how Schlumberger and Evernote implemented SLOs and used the insights gained to manage their businesses

Table of contents

  1. Executive Summary
    1. Managing Change with SLOs and Error Budgets
      1. SLOs Solve the Dev/Ops Split
    2. Key Findings from the SLO Adoption and Usage Survey
  2. Preface
    1. What We Hope You Take Away from This Report
    2. Resources Available
    3. Acknowledgments
  3. 1. SLOs: The Magic Behind SRE
    1. Defining SRE Terms for Measuring and Managing Your System
      1. SLIs: How Do We Measure Performance Against Our Goals?
      2. SLOs: What Are Our Goals?
      3. SLAs: What Level of Service Are We Promising Our Customers?
    2. SLOs Are the Driving Force Behind SRE Teams
    3. SLOs Are Powerful Business Tools That Drive Financial and Operational Performance
      1. SLOs and Error Budgets Allow Maximum Change Velocity While Protecting Stability
      2. SLOs Keep Business Decisions Focused on Customer Happiness
      3. SLOs Set Customers’ Expectations
    4. Summary
  4. 2. Summary of the Data
    1. Who Took Our Survey
      1. Geographic Region
      2. Principal Industry
      3. Organization Size
      4. Titles of Survey Respondents
    2. Most Firms Have Had SRE Teams for Fewer Than Three Years
    3. Who Uses SLOs
      1. SLOs Are a New Practice for Many Organizations
      2. Large Companies Have More Experience Using SLOs
      3. Recent SLO Adoption in Europe Outpaces Other Regions
    4. How Organizations Use SLOs
      1. Most Firms Embrace SRE Practices but Fail to Engage in SLOs
      2. Critical Infrastructure Is the Most Common Service Measured by SLOs
      3. Majority of Respondents Measure “Some” of Their Services with SLOs
      4. SLOs Above 99% Are Most Common Among Respondents
      5. Internal Action Is the Most Common Response to Missing SLOs
      6. SLO Reviews Are Underutilized by the Majority of Respondents
    5. Availability Is the Top SLI Measurement
    6. Summary
  5. 3. Selecting SLOs
    1. Do Not Let Perfect Be the Enemy of Good
    2. SLOs: What They Are and Why We Have Them
      1. How Do We Prioritize Reliability Versus Other Features?
      2. Can We Release New Features and Risk Breaking the System Without Significantly Impacting the User’s Experience?
      3. How Do We Weigh Operational Versus Project Work?
    3. Characteristics of Meaningful SLOs
      1. User-Centric
      2. Challenging but Not Too Challenging
      3. Specific yet Simple
      4. Shared Sense of SLO Ownership
    4. Best Practices for SLO Selection
      1. Avoid 100% Targets and Absolutes
      2. Base SLOs on Current Performance if You Have Nothing Else
      3. Group SLOs by User Experience
      4. Develop More Than One Target for Some Services
      5. Give Yourself a Buffer
      6. Have a Plan to Iterate
    5. Summary
  6. 4. Constructing SLIs to Inform SLOs
    1. Defining SLIs
    2. SLIs Are Metrics to Deliver User Happiness
    3. Common SLI Types
      1. Requests and Response
      2. Data Processing
      3. Storage
    4. SLI Structure
      1. Standardize SLIs
      2. Aggregate Measurements
    5. Developing SLIs
      1. SLI Specifications and SLI Implementations
    6. Tracking Reliability with SLIs
      1. Availability
      2. Latency
      3. Quality
    7. Ways to Measure SLIs
    8. Use SLIs to Define SLOs
      1. Achievable SLOs
      2. Aspirational SLOs
    9. Determine a Time Window for Measuring SLOs
    10. SLO Examples for Availability and Latency
    11. Iterating and Improving SLOs
    12. Summary
  7. 5. Using Error Budgets to Manage a Service
    1. The Relationship Between SLOs and Error Budgets
    2. Negotiating Technical Work Versus Development Velocity
    3. Stakeholder Buy-in and Establishing an Error-Budget Policy
    4. Summary
  8. 6. SLO Implementation Case Studies
    1. Schlumberger’s SLO Journey
      1. Why Schlumberger Implemented SRE
      2. Implementing the First SLOs for a Nonnative Cloud Application
      3. Establishing and Evolving SLOs for Products That Are Not Yet Live
      4. Example: How Having SLOs Led to a Better Customer Experience
      5. Monitoring and Alerting
      6. Evangelizing SRE and SLOs
      7. What’s Next
    2. Evernote’s SLO Journey
      1. Start at the Beginning
      2. Evernote Today: Transitioning to a Shared Responsibility Model
      3. Reducing Inefficiencies for a Progressive Environment
      4. Taking the Mystery Out of Resource Allocation
      5. Quantifying the Impact of Outages on Users
      6. Where Evernote Is Today
    3. Summary
  9. 7. Conclusion

Product information

  • Title: SLO Adoption and Usage in Site Reliability Engineering
  • Author(s): Julie McCoy, Nicole Forsgren
  • Release date: April 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492075363