O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

The Site Reliability Workbook

Book Description

In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.

This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.

Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.

You’ll learn:

  • How to run reliable services in environments you don’t completely control—like cloud
  • Practical applications of how to create, monitor, and run your services via Service Level Objectives
  • How to convert existing ops teams to SRE—including how to dig out of operational overload
  • Methods for starting SRE from either greenfield or brownfield

Table of Contents

  1. Foreword I
  2. Foreword II
  3. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Safari
    4. How to Contact Us
    5. Acknowledgments
  4. 1. How SRE Relates to DevOps
    1. Background on DevOps
      1. No More Silos
      2. Accidents Are Normal
      3. Change Should Be Gradual
      4. Tooling and Culture Are Interrelated
      5. Measurement Is Crucial
    2. Background on SRE
      1. Operations Is a Software Problem
      2. Manage by Service Level Objectives (SLOs)
      3. Work to Minimize Toil
      4. Automate This Year’s Job Away
      5. Move Fast by Reducing the Cost of Failure
      6. Share Ownership with Developers
      7. Use the Same Tooling, Regardless of Function or Job Title
    3. Compare and Contrast
    4. Organizational Context and Fostering Successful Adoption
      1. Narrow, Rigid Incentives Narrow Your Success
      2. It’s Better to Fix It Yourself; Don’t Blame Someone Else
      3. Consider Reliability Work as a Specialized Role
      4. When Can Substitute for Whether
      5. Strive for Parity of Esteem: Career and Financial
    5. Conclusion
  5. I. Foundations
  6. 2. Implementing SLOs
    1. Why SREs Need SLOs
    2. Getting Started
      1. Reliability Targets and Error Budgets
      2. What to Measure: Using SLIs
    3. A Worked Example
      1. Moving from SLI Specification to SLI Implementation
      2. Measuring the SLIs
      3. Using the SLIs to Calculate Starter SLOs
    4. Choosing an Appropriate Time Window
    5. Getting Stakeholder Agreement
      1. Establishing an Error Budget Policy
      2. Documenting the SLO and Error Budget Policy
      3. Dashboards and Reports
    6. Continuous Improvement of SLO Targets
      1. Improving the Quality of Your SLO
    7. Decision Making Using SLOs and Error Budgets
    8. Advanced Topics
      1. Modeling User Journeys
      2. Grading Interaction Importance
      3. Modeling Dependencies
      4. Experimenting with Relaxing Your SLOs
    9. Conclusion
  7. 3. SLO Engineering Case Studies
    1. Evernote’s SLO Story
      1. Why Did Evernote Adopt the SRE Model?
      2. Introduction of SLOs: A Journey in Progress
      3. Breaking Down the SLO Wall Between Customer and Cloud Provider
      4. Current State
    2. The Home Depot’s SLO Story
      1. The SLO Culture Project
      2. Our First Set of SLOs
      3. Evangelizing SLOs
      4. Automating VALET Data Collection
      5. The Proliferation of SLOs
      6. Applying VALET to Batch Applications
      7. Using VALET in Testing
      8. Future Aspirations
      9. Summary
    3. Conclusion
  8. 4. Monitoring
    1. Desirable Features of a Monitoring Strategy
      1. Speed
      2. Calculations
      3. Interfaces
      4. Alerts
    2. Sources of Monitoring Data
      1. Examples
    3. Managing Your Monitoring System
      1. Treat Your Configuration as Code
      2. Encourage Consistency
      3. Prefer Loose Coupling
    4. Metrics with Purpose
      1. Intended Changes
      2. Dependencies
      3. Saturation
      4. Status of Served Traffic
      5. Implementing Purposeful Metrics
    5. Testing Alerting Logic
    6. Conclusion
  9. 5. Alerting on SLOs
    1. Alerting Considerations
    2. Ways to Alert on Significant Events
      1. 1: Target Error Rate ≥ SLO Threshold
      2. 2: Increased Alert Window
      3. 3: Incrementing Alert Duration
      4. 4: Alert on Burn Rate
      5. 5: Multiple Burn Rate Alerts
      6. 6: Multiwindow, Multi-Burn-Rate Alerts
    3. Low-Traffic Services and Error Budget Alerting
      1. Generating Artificial Traffic
      2. Combining Services
      3. Making Service and Infrastructure Changes
      4. Lowering the SLO or Increasing the Window
    4. Extreme Availability Goals
    5. Alerting at Scale
    6. Conclusion
  10. 6. Eliminating Toil
    1. What Is Toil?
    2. Measuring Toil
    3. Toil Taxonomy
      1. Business Processes
      2. Production Interrupts
      3. Release Shepherding
      4. Migrations
      5. Cost Engineering and Capacity Planning
      6. Troubleshooting for Opaque Architectures
    4. Toil Management Strategies
      1. Identify and Measure Toil
      2. Engineer Toil Out of the System
      3. Reject the Toil
      4. Use SLOs to Reduce Toil
      5. Start with Human-Backed Interfaces
      6. Provide Self-Service Methods
      7. Get Support from Management and Colleagues
      8. Promote Toil Reduction as a Feature
      9. Start Small and Then Improve
      10. Increase Uniformity
      11. Assess Risk Within Automation
      12. Automate Toil Response
      13. Use Open Source and Third-Party Tools
      14. Use Feedback to Improve
    5. Case Studies
    6. Case Study 1: Reducing Toil in the Datacenter with Automation
      1. Background
      2. Problem Statement
      3. What We Decided to Do
      4. Design First Effort: Saturn Line-Card Repair
      5. Implementation
      6. Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair
      7. Implementation
      8. Lessons Learned
    7. Case Study 2: Decommissioning Filer-Backed Home Directories
      1. Background
      2. Problem Statement
      3. What We Decided to Do
      4. Design and Implementation
      5. Key Components
      6. Lessons Learned
    8. Conclusion
  11. 7. Simplicity
    1. Measuring Complexity
    2. Simplicity Is End-to-End, and SREs Are Good for That
      1. Case Study 1: End-to-End API Simplicity
      2. Case Study 2: Project Lifecycle Complexity
    3. Regaining Simplicity
      1. Case Study 3: Simplification of the Display Ads Spiderweb
      2. Case Study 4: Running Hundreds of Microservices on a Shared Platform
      3. Case Study 5: pDNS No Longer Depends on Itself
    4. Conclusion
  12. II. Practices
  13. 8. On-Call
    1. Recap of “Being On-Call” Chapter of First SRE Book
    2. Example On-Call Setups Within Google and Outside Google
      1. Google: Forming a New Team
      2. Evernote: Finding Our Feet in the Cloud
    3. Practical Implementation Details
      1. Anatomy of Pager Load
      2. On-Call Flexibility
      3. On-Call Team Dynamics
    4. Conclusion
  14. 9. Incident Response
    1. Incident Management at Google
      1. Incident Command System
      2. Main Roles in Incident Response
    2. Case Studies
      1. Case Study 1: Software Bug—The Lights Are On but No One’s (Google) Home
      2. Case Study 2: Service Fault—Cache Me If You Can
      3. Case Study 3: Power Outage—Lightning Never Strikes Twice…Until It Does
      4. Case Study 4: Incident Response at PagerDuty
    3. Putting Best Practices into Practice
      1. Incident Response Training
      2. Prepare Beforehand
      3. Drills
    4. Conclusion
  15. 10. Postmortem Culture: Learning from Failure
    1. Case Study
    2. Bad Postmortem
      1. Why Is This Postmortem Bad?
    3. Good Postmortem
      1. Why Is This Postmortem Better?
    4. Organizational Incentives
      1. Model and Enforce Blameless Behavior
      2. Reward Postmortem Outcomes
      3. Share Postmortems Openly
      4. Respond to Postmortem Culture Failures
    5. Tools and Templates
      1. Postmortem Templates
      2. Postmortem Tooling
    6. Conclusion
  16. 11. Managing Load
    1. Google Cloud Load Balancing
      1. Anycast
      2. Maglev
      3. Global Software Load Balancer
      4. Google Front End
      5. GCLB: Low Latency
      6. GCLB: High Availability
      7. Case Study 1: Pokémon GO on GCLB
    2. Autoscaling
      1. Handling Unhealthy Machines
      2. Working with Stateful Systems
      3. Configuring Conservatively
      4. Setting Constraints
      5. Including Kill Switches and Manual Overrides
      6. Avoiding Overloading Backends
      7. Avoiding Traffic Imbalance
    3. Combining Strategies to Manage Load
      1. Case Study 2: When Load Shedding Attacks
    4. Conclusion
  17. 12. Introducing Non-Abstract Large System Design
    1. What Is NALSD?
    2. Why “Non-Abstract”?
    3. AdWords Example
      1. Design Process
      2. Initial Requirements
      3. One Machine
      4. Distributed System
    4. Conclusion
  18. 13. Data Processing Pipelines
    1. Pipeline Applications
      1. Event Processing/Data Transformation to Order or Structure Data
      2. Data Analytics
      3. Machine Learning
    2. Pipeline Best Practices
      1. Define and Measure Service Level Objectives
      2. Plan for Dependency Failure
      3. Create and Maintain Pipeline Documentation
      4. Map Your Development Lifecycle
      5. Reduce Hotspotting and Workload Patterns
      6. Implement Autoscaling and Resource Planning
      7. Adhere to Access Control and Security Policies
      8. Plan Escalation Paths
    3. Pipeline Requirements and Design
      1. What Features Do You Need?
      2. Idempotent and Two-Phase Mutations
      3. Checkpointing
      4. Code Patterns
      5. Pipeline Production Readiness
    4. Pipeline Failures: Prevention and Response
      1. Potential Failure Modes
      2. Potential Causes
    5. Case Study: Spotify
      1. Event Delivery
      2. Event Delivery System Design and Architecture
      3. Event Delivery System Operation
      4. Customer Integration and Support
      5. Summary
    6. Conclusion
  19. 14. Configuration Design and Best Practices
    1. What Is Configuration?
      1. Configuration and Reliability
      2. Separating Philosophy and Mechanics
    2. Configuration Philosophy
      1. Configuration Asks Users Questions
      2. Questions Should Be Close to User Goals
      3. Mandatory and Optional Questions
      4. Escaping Simplicity
    3. Mechanics of Configuration
      1. Separate Configuration and Resulting Data
      2. Importance of Tooling
      3. Ownership and Change Tracking
      4. Safe Configuration Change Application
    4. Conclusion
  20. 15. Configuration Specifics
    1. Configuration-Induced Toil
    2. Reducing Configuration-Induced Toil
    3. Critical Properties and Pitfalls of Configuration Systems
      1. Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
      2. Pitfall 2: Designing Accidental or Ad Hoc Language Features
      3. Pitfall 3: Building Too Much Domain-Specific Optimization
      4. Pitfall 4: Interleaving “Configuration Evaluation” with “Side Effects”
      5. Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
    4. Integrating a Configuration Language
      1. Generating Config in Specific Formats
      2. Driving Multiple Applications
    5. Integrating an Existing Application: Kubernetes
      1. What Kubernetes Provides
      2. Example Kubernetes Config
      3. Integrating the Configuration Language
    6. Integrating Custom Applications (In-House Software)
    7. Effectively Operating a Configuration System
      1. Versioning
      2. Source Control
      3. Tooling
      4. Testing
    8. When to Evaluate Configuration
      1. Very Early: Checking in the JSON
      2. Middle of the Road: Evaluate at Build Time
      3. Late: Evaluate at Runtime
    9. Guarding Against Abusive Configuration
    10. Conclusion
  21. 16. Canarying Releases
    1. Release Engineering Principles
    2. Balancing Release Velocity and Reliability
    3. What Is Canarying?
    4. Release Engineering and Canarying
      1. Requirements of a Canary Process
      2. Our Example Setup
    5. A Roll Forward Deployment Versus a Simple Canary Deployment
    6. Canary Implementation
      1. Minimizing Risk to SLOs and the Error Budget
      2. Choosing a Canary Population and Duration
    7. Selecting and Evaluating Metrics
      1. Metrics Should Indicate Problems
      2. Metrics Should Be Representative and Attributable
      3. Before/After Evaluation Is Risky
      4. Use a Gradual Canary for Better Metric Selection
    8. Dependencies and Isolation
    9. Canarying in Noninteractive Systems
    10. Requirements on Monitoring Data
    11. Related Concepts
      1. Blue/Green Deployment
      2. Artificial Load Generation
      3. Traffic Teeing
    12. Conclusion
  22. III. Processes
  23. 17. Identifying and Recovering from Overload
    1. From Load to Overload
    2. Case Study 1: Work Overload When Half a Team Leaves
      1. Background
      2. Problem Statement
      3. What We Decided to Do
      4. Implementation
      5. Lessons Learned
    3. Case Study 2: Perceived Overload After Organizational and Workload Changes
      1. Background
      2. Problem Statement
      3. What We Decided to Do
      4. Implementation
      5. Effects
      6. Lessons Learned
    4. Strategies for Mitigating Overload
      1. Recognizing the Symptoms of Overload
      2. Reducing Overload and Restoring Team Health
    5. Conclusion
  24. 18. SRE Engagement Model
    1. The Service Lifecycle
      1. Phase 1: Architecture and Design
      2. Phase 2: Active Development
      3. Phase 3: Limited Availability
      4. Phase 4: General Availability
      5. Phase 5: Deprecation
      6. Phase 6: Abandoned
      7. Phase 7: Unsupported
    2. Setting Up the Relationship
      1. Communicating Business and Production Priorities
      2. Identifying Risks
      3. Aligning Goals
      4. Setting Ground Rules
      5. Planning and Executing
    3. Sustaining an Effective Ongoing Relationship
      1. Investing Time in Working Better Together
      2. Maintaining an Open Line of Communication
      3. Performing Regular Service Reviews
      4. Reassessing When Ground Rules Start to Slip
      5. Adjusting Priorities According to Your SLOs and Error Budget
      6. Handling Mistakes Appropriately
    4. Scaling SRE to Larger Environments
      1. Supporting Multiple Services with a Single SRE Team
      2. Structuring a Multiple SRE Team Environment
      3. Adapting SRE Team Structures to Changing Circumstances
      4. Running Cohesive Distributed SRE Teams
    5. Ending the Relationship
      1. Case Study 1: Ares
      2. Case Study 2: Data Analysis Pipeline
    6. Conclusion
  25. 19. SRE: Reaching Beyond Your Walls
    1. Truths We Hold to Be Self-Evident
      1. Reliability Is the Most Important Feature
      2. Your Users, Not Your Monitoring, Decide Your Reliability
      3. If You Run a Platform, Then Reliability Is a Partnership
      4. Everything Important Eventually Becomes a Platform
      5. When Your Customers Have a Hard Time, You Have to Slow Down
      6. You Will Need to Practice SRE with Your Customers
    2. How to: SRE with Your Customers
      1. Step 1: SLOs and SLIs Are How You Speak
      2. Step 2: Audit the Monitoring and Build Shared Dashboards
      3. Step 3: Measure and Renegotiate
      4. Step 4: Design Reviews and Risk Analysis
      5. Step 5: Practice, Practice, Practice
      6. Be Thoughtful and Disciplined
    3. Conclusion
  26. 20. SRE Team Lifecycles
    1. SRE Practices Without SREs
    2. Starting an SRE Role
      1. Finding Your First SRE
      2. Placing Your First SRE
      3. Bootstrapping Your First SRE
      4. Distributed SREs
    3. Your First SRE Team
      1. Forming
      2. Storming
      3. Norming
      4. Performing
    4. Making More SRE Teams
      1. Service Complexity
      2. SRE Rollout
      3. Geographical Splits
    5. Suggested Practices for Running Many Teams
      1. Mission Control
      2. SRE Exchange
      3. Training
      4. Horizontal Projects
      5. SRE Mobility
      6. Travel
      7. Launch Coordination Engineering Teams
      8. Production Excellence
      9. SRE Funding and Hiring
    6. Conclusion
  27. 21. Organizational Change Management in SRE
    1. SRE Embraces Change
    2. Introduction to Change Management
      1. Lewin’s Three-Stage Model
      2. McKinsey’s 7-S Model
      3. Kotter’s Eight-Step Process for Leading Change
      4. The Prosci ADKAR Model
      5. Emotion-Based Models
      6. The Deming Cycle
      7. How These Theories Apply to SRE
    3. Case Study 1: Scaling Waze—From Ad Hoc to Planned Change
      1. Background
      2. The Messaging Queue: Replacing a System While Maintaining Reliability
      3. The Next Cycle of Change: Improving the Deployment Process
      4. Lessons Learned
    4. Case Study 2: Common Tooling Adoption in SRE
      1. Background
      2. Problem Statement
      3. What We Decided to Do
      4. Design
      5. Implementation: Monitoring
      6. Lessons Learned
    5. Conclusion
  28. Conclusion
    1. Onward…
    2. The Future Belongs to the Past
    3. SRE + <Insert Other Discipline>
    4. Trickles, Streams, and Floods
    5. SRE Belongs to All of Us
    6. On Gratitude
  29. A. Example SLO Document
    1. Service Overview
    2. SLIs and SLOs
    3. Rationale
    4. Error Budget
    5. Clarifications and Caveats
  30. B. Example Error Budget Policy
    1. Service Overview
    2. Goals
    3. Non-Goals
    4. SLO Miss Policy
    5. Outage Policy
    6. Escalation Policy
    7. Background
  31. C. Results of Postmortem Analysis
  32. Index