book

Site Reliability Engineering

by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

April 2016

Intermediate to advanced

552 pages

15h 44m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Foreword
Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
I. Introduction
1. Introduction
The Sysadmin Approach to Service ManagementGoogle’s Approach to Service Management: Site Reliability EngineeringTenets of SREEnsuring a Durable Focus on EngineeringPursuing Maximum Change Velocity Without Violating a Service’s SLOMonitoringEmergency ResponseChange ManagementDemand Forecasting and Capacity PlanningProvisioningEfficiency and PerformanceThe End of the Beginning
2. The Production Environment at Google, from the Viewpoint of an SRE
HardwareSystem Software That “Organizes” the HardwareManaging MachinesStorageNetworkingOther System SoftwareLock ServiceMonitoring and AlertingOur Software InfrastructureOur Development EnvironmentShakespeare: A Sample ServiceLife of a RequestJob and Data Organization
II. Principles
3. Embracing Risk
Managing RiskMeasuring Service RiskRisk Tolerance of ServicesIdentifying the Risk Tolerance of Consumer ServicesIdentifying the Risk Tolerance of Infrastructure ServicesMotivation for Error BudgetsForming Your Error BudgetBenefits
4. Service Level Objectives
Service Level TerminologyIndicatorsObjectivesAgreementsIndicators in PracticeWhat Do You and Your Users Care About?Collecting IndicatorsAggregationStandardize IndicatorsObjectives in PracticeDefining ObjectivesChoosing TargetsControl MeasuresSLOs Set ExpectationsAgreements in Practice
5. Eliminating Toil
Toil DefinedWhy Less Toil Is BetterWhat Qualifies as Engineering?Is Toil Always Bad?Conclusion
6. Monitoring Distributed Systems
DefinitionsWhy Monitor?Setting Reasonable Expectations for MonitoringSymptoms Versus CausesBlack-Box Versus White-BoxThe Four Golden SignalsWorrying About Your Tail (or, Instrumentation and Performance)Choosing an Appropriate Resolution for MeasurementsAs Simple as Possible, No SimplerTying These Principles TogetherMonitoring for the Long TermBigtable SRE: A Tale of Over-AlertingGmail: Predictable, Scriptable Responses from HumansThe Long RunConclusion

7. The Evolution of Automation at Google
The Value of AutomationConsistencyA PlatformFaster RepairsFaster ActionTime SavingThe Value for Google SREThe Use Cases for AutomationGoogle SRE’s Use Cases for AutomationA Hierarchy of Automation ClassesAutomate Yourself Out of a Job: Automate ALL the Things!Soothing the Pain: Applying Automation to Cluster TurnupsDetecting Inconsistencies with ProdtestResolving Inconsistencies IdempotentlyThe Inclination to SpecializeService-Oriented Cluster-TurnupBorg: Birth of the Warehouse-Scale ComputerReliability Is the Fundamental FeatureRecommendations
8. Release Engineering
The Role of a Release EngineerPhilosophySelf-Service ModelHigh VelocityHermetic BuildsEnforcement of Policies and ProceduresContinuous Build and DeploymentBuildingBranchingTestingPackagingRapidDeploymentConfiguration ManagementConclusionsIt’s Not Just for GooglersStart Release Engineering at the Beginning
9. Simplicity
System Stability Versus AgilityThe Virtue of BoringI Won’t Give Up My Code!The “Negative Lines of Code” MetricMinimal APIsModularityRelease SimplicityA Simple Conclusion
III. Practices
10. Practical Alerting from Time-Series Data
The Rise of BorgmonInstrumentation of ApplicationsCollection of Exported DataStorage in the Time-Series ArenaLabels and VectorsRule EvaluationAlertingSharding the Monitoring TopologyBlack-Box MonitoringMaintaining the ConfigurationTen Years On…
11. Being On-Call
IntroductionLife of an On-Call EngineerBalanced On-CallBalance in QuantityBalance in QualityCompensationFeeling SafeAvoiding Inappropriate Operational LoadOperational OverloadA Treacherous Enemy: Operational UnderloadConclusions
12. Effective Troubleshooting
TheoryIn PracticeProblem ReportTriageExamineDiagnoseTest and TreatNegative Results Are MagicCureCase StudyMaking Troubleshooting EasierConclusion
13. Emergency Response
What to Do When Systems BreakTest-Induced EmergencyDetailsResponseFindingsChange-Induced EmergencyDetailsResponseFindingsProcess-Induced EmergencyDetailsResponseFindingsAll Problems Have SolutionsLearn from the Past. Don’t Repeat It.Keep a History of OutagesAsk the Big, Even Improbable, Questions: What If…?Encourage Proactive TestingConclusion
14. Managing Incidents
Unmanaged IncidentsThe Anatomy of an Unmanaged IncidentSharp Focus on the Technical ProblemPoor CommunicationFreelancingElements of Incident Management ProcessRecursive Separation of ResponsibilitiesA Recognized Command PostLive Incident State DocumentClear, Live HandoffA Managed IncidentWhen to Declare an IncidentIn Summary
15. Postmortem Culture: Learning from Failure
Google’s Postmortem PhilosophyCollaborate and Share KnowledgeIntroducing a Postmortem CultureConclusion and Ongoing Improvements
16. Tracking Outages
EscalatorOutalatorAggregationTaggingAnalysisUnexpected Benefits
17. Testing for Reliability
Types of Software TestingTraditional TestsProduction TestsCreating a Test and Build EnvironmentTesting at ScaleTesting Scalable ToolsTesting DisasterThe Need for SpeedPushing to ProductionExpect Testing FailIntegrationProduction ProbesConclusion
18. Software Engineering in SRE
Why Is Software Engineering Within SRE Important?Auxon Case Study: Project Background and Problem SpaceTraditional Capacity PlanningOur Solution: Intent-Based Capacity PlanningIntent-Based Capacity PlanningPrecursors to IntentIntroduction to AuxonRequirements and Implementation: Successes and Lessons LearnedRaising Awareness and Driving AdoptionTeam DynamicsFostering Software Engineering in SRESuccessfully Building a Software Engineering Culture in SRE: Staffing and Development TimeGetting ThereConclusions
19. Load Balancing at the Frontend
Power Isn’t the AnswerLoad Balancing Using DNSLoad Balancing at the Virtual IP Address
20. Load Balancing in the Datacenter
The Ideal CaseIdentifying Bad Tasks: Flow Control and Lame DucksA Simple Approach to Unhealthy Tasks: Flow ControlA Robust Approach to Unhealthy Tasks: Lame Duck StateLimiting the Connections Pool with SubsettingPicking the Right SubsetA Subset Selection Algorithm: Random SubsettingA Subset Selection Algorithm: Deterministic SubsettingLoad Balancing PoliciesSimple Round RobinLeast-Loaded Round RobinWeighted Round Robin
21. Handling Overload
The Pitfalls of “Queries per Second”Per-Customer LimitsClient-Side ThrottlingCriticalityUtilization SignalsHandling Overload ErrorsDeciding to RetryLoad from ConnectionsConclusions
22. Addressing Cascading Failures
Causes of Cascading Failures and Designing to Avoid ThemServer OverloadResource ExhaustionService UnavailabilityPreventing Server OverloadQueue ManagementLoad Shedding and Graceful DegradationRetriesLatency and DeadlinesSlow Startup and Cold CachingAlways Go Downward in the StackTriggering Conditions for Cascading FailuresProcess DeathProcess UpdatesNew RolloutsOrganic GrowthPlanned Changes, Drains, or TurndownsTesting for Cascading FailuresTest Until Failure and BeyondTest Popular ClientsTest Noncritical BackendsImmediate Steps to Address Cascading FailuresIncrease ResourcesStop Health Check Failures/DeathsRestart ServersDrop TrafficEnter Degraded ModesEliminate Batch LoadEliminate Bad TrafficClosing Remarks
23. Managing Critical State: Distributed Consensus for Reliability
Motivating the Use of Consensus: Distributed Systems Coordination FailureCase Study 1: The Split-Brain ProblemCase Study 2: Failover Requires Human InterventionCase Study 3: Faulty Group-Membership AlgorithmsHow Distributed Consensus WorksPaxos Overview: An Example ProtocolSystem Architecture Patterns for Distributed ConsensusReliable Replicated State MachinesReliable Replicated Datastores and Configuration StoresHighly Available Processing Using Leader ElectionDistributed Coordination and Locking ServicesReliable Distributed Queuing and MessagingDistributed Consensus PerformanceMulti-Paxos: Detailed Message FlowScaling Read-Heavy WorkloadsQuorum LeasesDistributed Consensus Performance and Network LatencyReasoning About Performance: Fast PaxosStable LeadersBatchingDisk AccessDeploying Distributed Consensus-Based SystemsNumber of ReplicasLocation of ReplicasCapacity and Load BalancingMonitoring Distributed Consensus SystemsConclusion
24. Distributed Periodic Scheduling with Cron
CronIntroductionReliability PerspectiveCron Jobs and IdempotencyCron at Large ScaleExtended InfrastructureExtended RequirementsBuilding Cron at GoogleTracking the State of Cron JobsThe Use of PaxosThe Roles of the Leader and the FollowerStoring the StateRunning Large CronSummary
25. Data Processing Pipelines
Origin of the Pipeline Design PatternInitial Effect of Big Data on the Simple Pipeline PatternChallenges with the Periodic Pipeline PatternTrouble Caused By Uneven Work DistributionDrawbacks of Periodic Pipelines in Distributed EnvironmentsMonitoring Problems in Periodic Pipelines“Thundering Herd” ProblemsMoiré Load PatternIntroduction to Google WorkflowWorkflow as Model-View-Controller PatternStages of Execution in WorkflowWorkflow Correctness GuaranteesEnsuring Business ContinuitySummary and Concluding Remarks
26. Data Integrity: What You Read Is What You Wrote
Data Integrity’s Strict RequirementsChoosing a Strategy for Superior Data IntegrityBackups Versus ArchivesRequirements of the Cloud Environment in PerspectiveGoogle SRE Objectives in Maintaining Data Integrity and AvailabilityData Integrity Is the Means; Data Availability Is the GoalDelivering a Recovery System, Rather Than a Backup SystemTypes of Failures That Lead to Data LossChallenges of Maintaining Data Integrity Deep and WideHow Google SRE Faces the Challenges of Data IntegrityThe 24 Combinations of Data Integrity Failure ModesFirst Layer: Soft DeletionSecond Layer: Backups and Their Related Recovery MethodsOverarching Layer: Replication1T Versus 1E: Not “Just” a Bigger BackupThird Layer: Early DetectionKnowing That Data Recovery Will WorkCase StudiesGmail—February, 2011: Restore from GTapeGoogle Music—March 2012: Runaway Deletion DetectionGeneral Principles of SRE as Applied to Data IntegrityBeginner’s MindTrust but VerifyHope Is Not a StrategyDefense in DepthConclusion
27. Reliable Product Launches at Scale
Launch Coordination EngineeringThe Role of the Launch Coordination EngineerSetting Up a Launch ProcessThe Launch ChecklistDriving Convergence and SimplificationLaunching the UnexpectedDeveloping a Launch ChecklistArchitecture and DependenciesIntegrationCapacity PlanningFailure ModesClient BehaviorProcesses and AutomationDevelopment ProcessExternal DependenciesRollout PlanningSelected Techniques for Reliable LaunchesGradual and Staged RolloutsFeature Flag FrameworksDealing with Abusive Client BehaviorOverload Behavior and Load TestsDevelopment of LCEEvolution of the LCE ChecklistProblems LCE Didn’t SolveConclusion
IV. Management
28. Accelerating SREs to On-Call and Beyond
You’ve Hired Your Next SRE(s), Now What?Initial Learning Experiences: The Case for Structure Over ChaosLearning Paths That Are Cumulative and OrderlyTargeted Project Work, Not Menial WorkCreating Stellar Reverse Engineers and Improvisational ThinkersReverse Engineers: Figuring Out How Things WorkStatistical and Comparative Thinkers: Stewards of the Scientific Method Under PressureImprov Artists: When the Unexpected HappensTying This Together: Reverse Engineering a Production ServiceFive Practices for Aspiring On-CallersA Hunger for Failure: Reading and Sharing PostmortemsDisaster Role PlayingBreak Real Things, Fix Real ThingsDocumentation as ApprenticeshipShadow On-Call Early and OftenOn-Call and Beyond: Rites of Passage, and Practicing Continuing EducationClosing Thoughts
29. Dealing with Interrupts
Managing Operational LoadFactors in Determining How Interrupts Are HandledImperfect MachinesCognitive Flow StateDo One Thing WellSeriously, Tell Me What to DoReducing Interrupts
30. Embedding an SRE to Recover from Operational Overload
Phase 1: Learn the Service and Get ContextIdentify the Largest Sources of StressIdentify KindlingPhase 2: Sharing ContextWrite a Good Postmortem for the TeamSort Fires According to TypePhase 3: Driving ChangeStart with the BasicsGet Help Clearing KindlingExplain Your ReasoningAsk Leading QuestionsConclusion
31. Communication and Collaboration in SRE
Communications: Production MeetingsAgendaAttendanceCollaboration within SRETeam CompositionTechniques for Working EffectivelyCase Study of Collaboration in SRE: ViceroyThe Coming of the ViceroyChallengesRecommendationsCollaboration Outside SRECase Study: Migrating DFP to F1Conclusion
32. The Evolving SRE Engagement Model
SRE Engagement: What, How, and WhyThe PRR ModelThe SRE Engagement ModelAlternative SupportProduction Readiness Reviews: Simple PRR ModelEngagementAnalysisImprovements and RefactoringTrainingOnboardingContinuous ImprovementEvolving the Simple PRR Model: Early EngagementCandidates for Early EngagementBenefits of the Early Engagement ModelEvolving Services Development: Frameworks and SRE PlatformLessons LearnedExternal Factors Affecting SREToward a Structural Solution: FrameworksNew Service and Management BenefitsConclusion
V. Conclusions
33. Lessons Learned from Other Industries
Meet Our Industry VeteransPreparedness and Disaster TestingRelentless Organizational Focus on SafetyAttention to DetailSwing CapacitySimulations and Live DrillsTraining and CertificationFocus on Detailed Requirements Gathering and DesignDefense in Depth and BreadthPostmortem CultureAutomating Away Repetitive Work and Operational OverheadStructured and Rational Decision MakingConclusions
34. Conclusion
A. Availability Table
B. A Collection of Best Practices for Production Services
Fail SanelyProgressive RolloutsDefine SLOs Like a UserError BudgetsMonitoringPostmortemsCapacity PlanningOverloads and FailureSRE Teams
C. Example Incident State Document
D. Example Postmortem
Lessons LearnedTimelineSupporting information:
E. Launch Coordination Checklist
F. Example Production Meeting Minutes
Bibliography
Index

Content preview from Site Reliability Engineering

Part II. Principles

This section examines the principles underlying how SRE teams typically work—the patterns, behaviors, and areas of concern that influence the general domain of SRE operations.

The first chapter in this section, and the most important piece to read if you want to attain the widest-angle picture of what exactly SRE does, and how we reason about it, is Chapter 3, Embracing Risk. It looks at SRE through the lens of risk—its assessment, management, and the use of error budgets to provide usefully neutral approaches to service management.

Service level objectives are another foundational conceptual unit for SRE. The industry commonly lumps disparate concepts under the general banner of service level agreements, a tendency that makes it harder to think about these concepts clearly. Chapter 4, Service Level Objectives, attempts to disentangle indicators from objectives from agreements, examines how SRE uses each of these terms, and provides some recommendations on how to find useful metrics for your own applications.

Eliminating toil is one of SRE’s most important tasks, and is the subject of Chapter 5, Eliminating Toil. We define toil as mundane, repetitive operational work providing no enduring value, which scales linearly with service growth.

Whether it is at Google or elsewhere, monitoring is an absolutely essential component of doing the right thing in production. If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Site Reliability Engineering Fundamentals

Publisher Resources

ISBN: 9781491929117Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Site Reliability Engineering

by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

Part II. Principles

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.