book

The Site Reliability Workbook

by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne

July 2018

Intermediate to advanced

512 pages

13h 58m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword I
Foreword II
Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
1. How SRE Relates to DevOps
Background on DevOpsNo More SilosAccidents Are NormalChange Should Be GradualTooling and Culture Are InterrelatedMeasurement Is CrucialBackground on SREOperations Is a Software ProblemManage by Service Level Objectives (SLOs)Work to Minimize ToilAutomate This Year’s Job AwayMove Fast by Reducing the Cost of FailureShare Ownership with DevelopersUse the Same Tooling, Regardless of Function or Job TitleCompare and ContrastOrganizational Context and Fostering Successful AdoptionNarrow, Rigid Incentives Narrow Your SuccessIt’s Better to Fix It Yourself; Don’t Blame Someone ElseConsider Reliability Work as a Specialized RoleWhen Can Substitute for WhetherStrive for Parity of Esteem: Career and FinancialConclusion
I. Foundations
2. Implementing SLOs
Why SREs Need SLOsGetting StartedReliability Targets and Error BudgetsWhat to Measure: Using SLIsA Worked ExampleMoving from SLI Specification to SLI ImplementationMeasuring the SLIsUsing the SLIs to Calculate Starter SLOsChoosing an Appropriate Time WindowGetting Stakeholder AgreementEstablishing an Error Budget PolicyDocumenting the SLO and Error Budget PolicyDashboards and ReportsContinuous Improvement of SLO TargetsImproving the Quality of Your SLODecision Making Using SLOs and Error BudgetsAdvanced TopicsModeling User JourneysGrading Interaction ImportanceModeling DependenciesExperimenting with Relaxing Your SLOsConclusion
3. SLO Engineering Case Studies
Evernote’s SLO StoryWhy Did Evernote Adopt the SRE Model?Introduction of SLOs: A Journey in ProgressBreaking Down the SLO Wall Between Customer and Cloud ProviderCurrent StateThe Home Depot’s SLO StoryThe SLO Culture ProjectOur First Set of SLOsEvangelizing SLOsAutomating VALET Data CollectionThe Proliferation of SLOsApplying VALET to Batch ApplicationsUsing VALET in TestingFuture AspirationsSummaryConclusion
4. Monitoring
Desirable Features of a Monitoring StrategySpeedCalculationsInterfacesAlertsSources of Monitoring DataExamplesManaging Your Monitoring SystemTreat Your Configuration as CodeEncourage ConsistencyPrefer Loose CouplingMetrics with PurposeIntended ChangesDependenciesSaturationStatus of Served TrafficImplementing Purposeful MetricsTesting Alerting LogicConclusion
5. Alerting on SLOs
Alerting ConsiderationsWays to Alert on Significant Events1: Target Error Rate ≥ SLO Threshold2: Increased Alert Window3: Incrementing Alert Duration4: Alert on Burn Rate5: Multiple Burn Rate Alerts6: Multiwindow, Multi-Burn-Rate AlertsLow-Traffic Services and Error Budget AlertingGenerating Artificial TrafficCombining ServicesMaking Service and Infrastructure ChangesLowering the SLO or Increasing the WindowExtreme Availability GoalsAlerting at ScaleConclusion
6. Eliminating Toil
What Is Toil?Measuring ToilToil TaxonomyBusiness ProcessesProduction InterruptsRelease ShepherdingMigrationsCost Engineering and Capacity PlanningTroubleshooting for Opaque ArchitecturesToil Management StrategiesIdentify and Measure ToilEngineer Toil Out of the SystemReject the ToilUse SLOs to Reduce ToilStart with Human-Backed InterfacesProvide Self-Service MethodsGet Support from Management and ColleaguesPromote Toil Reduction as a FeatureStart Small and Then ImproveIncrease UniformityAssess Risk Within AutomationAutomate Toil ResponseUse Open Source and Third-Party ToolsUse Feedback to ImproveCase StudiesCase Study 1: Reducing Toil in the Datacenter with AutomationBackgroundProblem StatementWhat We Decided to DoDesign First Effort: Saturn Line-Card RepairImplementationDesign Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card RepairImplementationLessons LearnedCase Study 2: Decommissioning Filer-Backed Home DirectoriesBackgroundProblem StatementWhat We Decided to DoDesign and ImplementationKey ComponentsLessons LearnedConclusion

7. Simplicity
Measuring ComplexitySimplicity Is End-to-End, and SREs Are Good for ThatCase Study 1: End-to-End API SimplicityCase Study 2: Project Lifecycle ComplexityRegaining SimplicityCase Study 3: Simplification of the Display Ads SpiderwebCase Study 4: Running Hundreds of Microservices on a Shared PlatformCase Study 5: pDNS No Longer Depends on ItselfConclusion
II. Practices
8. On-Call
Recap of “Being On-Call” Chapter of First SRE BookExample On-Call Setups Within Google and Outside GoogleGoogle: Forming a New TeamEvernote: Finding Our Feet in the CloudPractical Implementation DetailsAnatomy of Pager LoadOn-Call FlexibilityOn-Call Team DynamicsConclusion
9. Incident Response
Incident Management at GoogleIncident Command SystemMain Roles in Incident ResponseCase StudiesCase Study 1: Software Bug—The Lights Are On but No One’s (Google) HomeCase Study 2: Service Fault—Cache Me If You CanCase Study 3: Power Outage—Lightning Never Strikes Twice…Until It DoesCase Study 4: Incident Response at PagerDutyPutting Best Practices into PracticeIncident Response TrainingPrepare BeforehandDrillsConclusion
10. Postmortem Culture: Learning from Failure
Case StudyBad PostmortemWhy Is This Postmortem Bad?Good PostmortemWhy Is This Postmortem Better?Organizational IncentivesModel and Enforce Blameless BehaviorReward Postmortem OutcomesShare Postmortems OpenlyRespond to Postmortem Culture FailuresTools and TemplatesPostmortem TemplatesPostmortem ToolingConclusion
11. Managing Load
Google Cloud Load BalancingAnycastMaglevGlobal Software Load BalancerGoogle Front EndGCLB: Low LatencyGCLB: High AvailabilityCase Study 1: Pokémon GO on GCLBAutoscalingHandling Unhealthy MachinesWorking with Stateful SystemsConfiguring ConservativelySetting ConstraintsIncluding Kill Switches and Manual OverridesAvoiding Overloading BackendsAvoiding Traffic ImbalanceCombining Strategies to Manage LoadCase Study 2: When Load Shedding AttacksConclusion
12. Introducing Non-Abstract Large System Design
What Is NALSD?Why “Non-Abstract”?AdWords ExampleDesign ProcessInitial RequirementsOne MachineDistributed SystemConclusion
13. Data Processing Pipelines
Pipeline ApplicationsEvent Processing/Data Transformation to Order or Structure DataData AnalyticsMachine LearningPipeline Best PracticesDefine and Measure Service Level ObjectivesPlan for Dependency FailureCreate and Maintain Pipeline DocumentationMap Your Development LifecycleReduce Hotspotting and Workload PatternsImplement Autoscaling and Resource PlanningAdhere to Access Control and Security PoliciesPlan Escalation PathsPipeline Requirements and DesignWhat Features Do You Need?Idempotent and Two-Phase MutationsCheckpointingCode PatternsPipeline Production ReadinessPipeline Failures: Prevention and ResponsePotential Failure ModesPotential CausesCase Study: SpotifyEvent DeliveryEvent Delivery System Design and ArchitectureEvent Delivery System OperationCustomer Integration and SupportSummaryConclusion
14. Configuration Design and Best Practices
What Is Configuration?Configuration and ReliabilitySeparating Philosophy and MechanicsConfiguration PhilosophyConfiguration Asks Users QuestionsQuestions Should Be Close to User GoalsMandatory and Optional QuestionsEscaping SimplicityMechanics of ConfigurationSeparate Configuration and Resulting DataImportance of ToolingOwnership and Change TrackingSafe Configuration Change ApplicationConclusion
15. Configuration Specifics
Configuration-Induced ToilReducing Configuration-Induced ToilCritical Properties and Pitfalls of Configuration SystemsPitfall 1: Failing to Recognize Configuration as a Programming Language ProblemPitfall 2: Designing Accidental or Ad Hoc Language FeaturesPitfall 3: Building Too Much Domain-Specific OptimizationPitfall 4: Interleaving “Configuration Evaluation” with “Side Effects”Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or LuaIntegrating a Configuration LanguageGenerating Config in Specific FormatsDriving Multiple ApplicationsIntegrating an Existing Application: KubernetesWhat Kubernetes ProvidesExample Kubernetes ConfigIntegrating the Configuration LanguageIntegrating Custom Applications (In-House Software)Effectively Operating a Configuration SystemVersioningSource ControlToolingTestingWhen to Evaluate ConfigurationVery Early: Checking in the JSONMiddle of the Road: Evaluate at Build TimeLate: Evaluate at RuntimeGuarding Against Abusive ConfigurationConclusion
16. Canarying Releases
Release Engineering PrinciplesBalancing Release Velocity and ReliabilityWhat Is Canarying?Release Engineering and CanaryingRequirements of a Canary ProcessOur Example SetupA Roll Forward Deployment Versus a Simple Canary DeploymentCanary ImplementationMinimizing Risk to SLOs and the Error BudgetChoosing a Canary Population and DurationSelecting and Evaluating MetricsMetrics Should Indicate ProblemsMetrics Should Be Representative and AttributableBefore/After Evaluation Is RiskyUse a Gradual Canary for Better Metric SelectionDependencies and IsolationCanarying in Noninteractive SystemsRequirements on Monitoring DataRelated ConceptsBlue/Green DeploymentArtificial Load GenerationTraffic TeeingConclusion
III. Processes
17. Identifying and Recovering from Overload
From Load to OverloadCase Study 1: Work Overload When Half a Team LeavesBackgroundProblem StatementWhat We Decided to DoImplementationLessons LearnedCase Study 2: Perceived Overload After Organizational and Workload ChangesBackgroundProblem StatementWhat We Decided to DoImplementationEffectsLessons LearnedStrategies for Mitigating OverloadRecognizing the Symptoms of OverloadReducing Overload and Restoring Team HealthConclusion
18. SRE Engagement Model
The Service LifecyclePhase 1: Architecture and DesignPhase 2: Active DevelopmentPhase 3: Limited AvailabilityPhase 4: General AvailabilityPhase 5: DeprecationPhase 6: AbandonedPhase 7: UnsupportedSetting Up the RelationshipCommunicating Business and Production PrioritiesIdentifying RisksAligning GoalsSetting Ground RulesPlanning and ExecutingSustaining an Effective Ongoing RelationshipInvesting Time in Working Better TogetherMaintaining an Open Line of CommunicationPerforming Regular Service ReviewsReassessing When Ground Rules Start to SlipAdjusting Priorities According to Your SLOs and Error BudgetHandling Mistakes AppropriatelyScaling SRE to Larger EnvironmentsSupporting Multiple Services with a Single SRE TeamStructuring a Multiple SRE Team EnvironmentAdapting SRE Team Structures to Changing CircumstancesRunning Cohesive Distributed SRE TeamsEnding the RelationshipCase Study 1: AresCase Study 2: Data Analysis PipelineConclusion
19. SRE: Reaching Beyond Your Walls
Truths We Hold to Be Self-EvidentReliability Is the Most Important FeatureYour Users, Not Your Monitoring, Decide Your ReliabilityIf You Run a Platform, Then Reliability Is a PartnershipEverything Important Eventually Becomes a PlatformWhen Your Customers Have a Hard Time, You Have to Slow DownYou Will Need to Practice SRE with Your CustomersHow to: SRE with Your CustomersStep 1: SLOs and SLIs Are How You SpeakStep 2: Audit the Monitoring and Build Shared DashboardsStep 3: Measure and RenegotiateStep 4: Design Reviews and Risk AnalysisStep 5: Practice, Practice, PracticeBe Thoughtful and DisciplinedConclusion
20. SRE Team Lifecycles
SRE Practices Without SREsStarting an SRE RoleFinding Your First SREPlacing Your First SREBootstrapping Your First SREDistributed SREsYour First SRE TeamFormingStormingNormingPerformingMaking More SRE TeamsService ComplexitySRE RolloutGeographical SplitsSuggested Practices for Running Many TeamsMission ControlSRE ExchangeTrainingHorizontal ProjectsSRE MobilityTravelLaunch Coordination Engineering TeamsProduction ExcellenceSRE Funding and HiringConclusion
21. Organizational Change Management in SRE
SRE Embraces ChangeIntroduction to Change ManagementLewin’s Three-Stage ModelMcKinsey’s 7-S ModelKotter’s Eight-Step Process for Leading ChangeThe Prosci ADKAR ModelEmotion-Based ModelsThe Deming CycleHow These Theories Apply to SRECase Study 1: Scaling Waze—From Ad Hoc to Planned ChangeBackgroundThe Messaging Queue: Replacing a System While Maintaining ReliabilityThe Next Cycle of Change: Improving the Deployment ProcessLessons LearnedCase Study 2: Common Tooling Adoption in SREBackgroundProblem StatementWhat We Decided to DoDesignImplementation: MonitoringLessons LearnedConclusion
Conclusion
Onward…The Future Belongs to the PastSRE + <Insert Other Discipline>Trickles, Streams, and FloodsSRE Belongs to All of UsOn Gratitude
A. Example SLO Document
Service OverviewSLIs and SLOsRationaleError BudgetClarifications and Caveats
B. Example Error Budget Policy
Service OverviewGoalsNon-GoalsSLO Miss PolicyOutage PolicyEscalation PolicyBackground
C. Results of Postmortem Analysis
Index

Content preview from The Site Reliability Workbook

Part II. Practices

Building upon the solid foundation of SRE principles covered in Part I, Part II dives deep into how to conduct SRE-related activities that Google has found important for operating at scale.

Some of these topics, such as data processing pipelines and managing load, won’t apply to all organizations. Other topics, such as safely handling changes with configuration and canarying, on-call practices, and what to do when things go wrong, contain valuable lessons for any SRE team.

This part also introduces an important SRE skill—Non-Abstract Large System Design (NALSD)—and presents a detailed example of how to practice this design process.

As we move from SRE foundations to practices, we wanted to provide a bit more context on the relationship between operational duties and project work, and the engineering it takes to accomplish both strategically.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492029496Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

The Site Reliability Workbook

by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne

Part II. Practices

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.