Implementing Service Level Objectives

Book description

Although service-level objectives (SLOs) continue to grow in importance, there’s a distinct lack of information about how to implement them. Practical advice that does exist usually assumes that your team already has the infrastructure, tooling, and culture in place. In this book, recognized SLO expert Alex Hidalgo explains how to build an SLO culture from the ground up.

Ideal as a primer and daily reference for anyone creating both the culture and tooling necessary for SLO-based approaches to reliability, this guide provides detailed analysis of advanced SLO and service-level indicator (SLI) techniques. Armed with mathematical models and statistical knowledge to help you get the most out of an SLO-based approach, you’ll learn how to build systems capable of measuring meaningful SLIs with buy-in across all departments of your organization.

  • Define SLIs that meaningfully measure the reliability of a service from a user’s perspective
  • Choose appropriate SLO targets, including how to perform statistical and probabilistic analysis
  • Use error budgets to help your team have better discussions and make better data-driven decisions
  • Build supportive tooling and resources required for an SLO-based approach
  • Use SLO data to present meaningful reports to leadership and your users

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. You Don’t Have to Be Perfect
    2. How to Read This Book
    3. Conventions Used in This Book
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  3. I. SLO Development
  4. 1. The Reliability Stack
    1. Service Truths
    2. The Reliability Stack
      1. Service Level Indicators
      2. Service Level Objectives
      3. Error Budgets
    3. What Is a Service?
      1. Example Services
    4. Things to Keep in Mind
      1. SLOs Are Just Data
      2. SLOs Are a Process, Not a Project
      3. Iterate Over Everything
      4. The World Will Change
      5. It’s All About Humans
    5. Summary
  5. 2. How to Think About Reliability
    1. Reliability Engineering
    2. Past Performance and Your Users
      1. Implied Agreements
      2. Making Agreements
      3. A Worked Example of Reliability
    3. How Reliable Should You Be?
      1. 100% Isn’t Necessary
      2. Reliability Is Expensive
      3. How to Think About Reliability
    4. Summary
  6. 3. Developing Meaningful Service Level Indicators
    1. What Meaningful SLIs Provide
      1. Happier Users
      2. Happier Engineers
      3. A Happier Business
    2. Caring About Many Things
      1. A Request and Response Service
      2. Measuring Many Things by Measuring Only a Few
      3. A Written Example
    3. Something More Complex
      1. Measuring Complex Service User Reliability
      2. Another Written Example
      3. Business Alignment and SLIs
    4. Summary
  7. 4. Choosing Good Service Level Objectives
    1. Reliability Targets
      1. User Happiness
      2. The Problem of Being Too Reliable
      3. The Problem with the Number Nine
      4. The Problem with Too Many SLOs
    2. Service Dependencies and Components
      1. Service Dependencies
      2. Service Components
    3. Reliability for Things You Don’t Own
      1. Open Source or Hosted Services
      2. Measuring Hardware
    4. Choosing Targets
      1. Past Performance
      2. Basic Statistics
      3. Metric Attributes
      4. Percentile Thresholds
      5. What to Do Without a History
    5. Summary
  8. 5. How to Use Error Budgets
    1. Error Budgets in Practice
      1. To Release New Features or Not?
      2. Project Focus
      3. Examining Risk Factors
      4. Experimentation and Chaos Engineering
      5. Load and Stress Tests
      6. Blackhole Exercises
      7. Purposely Burning Budget
      8. Error Budgets for Humans
    2. Error Budget Measurement
      1. Establishing Error Budgets
      2. Decision Making
      3. Error Budget Policies
    3. Summary
  9. II. SLO Implementation
  10. 6. Getting Buy-In
    1. Engineering Is More than Code
    2. Key Stakeholders
      1. Engineering
      2. Product
      3. Operations
      4. QA
      5. Legal
      6. Executive Leadership
    3. Making It So
      1. Order of Operation
      2. Common Objections and How to Overcome Them
      3. Your First Error Budget Policy (and Your First Critical Test)
    4. Lessons Learned the Hard Way
    5. Summary
  11. 7. Measuring SLIs and SLOs
    1. Design Goals
      1. Flexible Targets
      2. Testable Targets
      3. Freshness
      4. Cost
      5. Reliability
      6. Organizational Constraints
    2. Common Machinery
      1. Centralized Time Series Statistics (Metrics)
      2. Structured Event Databases (Logging)
    3. Common Cases
      1. Latency-Sensitive Request Processing
      2. Low-Lag, High-Throughput Batch Processing
      3. Mobile and Web Clients
    4. The General Case
    5. Other Considerations
      1. Integration with Distributed Tracing
      2. SLI and SLO Discoverability
    6. Summary
  12. 8. SLO Monitoring and Alerting
    1. Motivation: What Is SLO Alerting, and Why Should You Do It?
      1. The Shortcomings of Simple Threshold Alerting
      2. A Better Way
    2. How to Do SLO Alerting
      1. Choosing a Target
      2. Error Budgets and Response Time
      3. Error Budget Burn Rate
      4. Rolling Windows
      5. Putting It Together
      6. Troubleshooting with SLO Alerting
      7. Corner Cases
      8. SLO Alerting in a Brownfield Setup
    3. Parting Recommendations
    4. Summary
  13. 9. Probability and Statistics for SLIs and SLOs
    1. On Probability
      1. SLI Example: Availability
      2. SLI Example: Low QPS
    2. On Statistics
      1. Maximum Likelihood Estimation
      2. Maximum a Posteriori
      3. Bayesian Inference
      4. SLI Example: Queueing Latency
      5. Batch Latency
    3. SLI Example: Durability
    4. Further Reading
    5. Summary
  14. 10. Architecting for Reliability
    1. Example System: Image-Serving Service
      1. Architectural Considerations: Hardware
      2. Architectural Considerations: Monolith or Microservices
      3. Architectural Considerations: Anticipating Failure Modes
      4. Architectural Considerations: Three Types of Requests
      5. Systems and Building Blocks
      6. Quantitative Analysis of Systems
      7. Instrumentation! The System Also Needs Instrumentation!
    2. Architectural Considerations: Hardware, Revisited
    3. SLOs as a Result of System SLIs
    4. The Importance of Identifying and Understanding Dependencies
    5. Summary
  15. 11. Data Reliability
    1. Data Services
      1. Designing Data Applications
    2. Users of Data Services
    3. Setting Measurable Data Objectives
      1. Data and Data Application Reliability
      2. Data Properties
      3. Data Application Properties
    4. System Design Concerns
      1. Data Application Failures
      2. Other Qualities
    5. Data Lineage
    6. Summary
  16. 12. A Worked Example
    1. Dogs Deserve Clothes
      1. How a Service Grows
      2. The Design of a Service
    2. SLIs and SLOs as User Journeys
      1. Customers: Finding and Browsing Products
      2. Other Services as Users: Buying Products
      3. Internal Users
      4. Platforms as Services
    3. Summary
  17. III. SLO Culture
  18. 13. Building an SLO Culture
    1. A Culture of No SLOs
    2. Strategies for Shifting Culture
    3. Path to a Culture of SLOs
      1. Getting Buy-in
      2. Prioritizing SLO Work
      3. Implementing Your SLO
      4. What Will Your SLIs Be?
      5. What Will Your SLOs Be?
      6. Using Your SLO
      7. Iterating on Your SLO
      8. Determining When Your SLOs Are Good Enough
      9. Advocating for Others to Use SLOs
    4. Summary
  19. 14. SLO Evolution
    1. SLO Genesis
      1. The First Pass
      2. Listening to Users
      3. Periodic Revisits
    2. Usage Changes
      1. Increased Utilization Changes
      2. Decreased Utilization Changes
      3. Functional Utilization Changes
    3. Dependency Changes
      1. Service Dependency Changes
      2. Platform Changes
      3. Dependency Introduction or Retirement
    4. Failure-Induced Changes
    5. User Expectation and Requirement Changes
      1. User Expectation Changes
      2. User Requirement Changes
    6. Tooling Changes
      1. Measurement Changes
      2. Calculation Changes
    7. Intuition-Based Changes
    8. Setting Aspirational SLOs
    9. Identifying Incorrect SLOs
      1. Listening to Users (Redux)
      2. Paying Attention to Failures
    10. How to Change SLOs
      1. Revisit Schedules
    11. Summary
  20. 15. Discoverable and Understandable SLOs
    1. Understandability
      1. SLO Definition Documents
      2. Phraseology
    2. Discoverability
      1. Document Repositories
      2. Discoverability Tooling
      3. SLO Reports
      4. Dashboards
    3. Summary
  21. 16. SLO Advocacy
    1. Crawl
      1. Do Your Research
      2. Prepare Your Sales Pitch
      3. Create Your Supporting Artifacts
      4. Run Your First Training and Workshop
      5. Implement an SLO Pilot with a Single Service
      6. Spread Your Message
      7. Learn How to Handle Challenges
    2. Walk
      1. Work with Early Adopters to Implement SLOs for More Services
      2. Celebrate Achievements and Build Confidence
      3. Create a Library of Case Studies
      4. Scale Your Training Program by Adding More Trainers
      5. Scale Your Communications
    3. Run
      1. Share Your Library of SLO Case Studies
      2. Create a Community of SLO Experts
      3. Continuously Improve
    4. Summary
  22. 17. Reliability Reporting
    1. Basic Reporting
      1. Counting Incidents
      2. Severity Levels
      3. The Problem with Mean Time to X
      4. SLOs for Basic Reporting
    2. Advanced Reporting
      1. SLO Status
      2. Error Budget Status
    3. Summary
  23. A. SLO Definition Template
    1. SLO Definition: Service Name
    2. Service Overview
    3. SLIs and SLOs
    4. Rationale
    5. Revisit Schedule
    6. Error Budget Policy
    7. External Links
  24. B. Proofs for Chapter 9
    1. Theorem 1
      1. Proof
    2. Theorem 2
      1. Proof
    3. Theorem 3
      1. Proof
    4. Theorem 4
      1. Proof
    5. Theorem 5
      1. Proof
    6. Theorem 6
      1. Proof
    7. Theorem 7
      1. Proof
  25. Index

Product information

  • Title: Implementing Service Level Objectives
  • Author(s): Alex Hidalgo
  • Release date: August 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492076810