book

Site Reliability Engineering

by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

April 2016

Intermediate to advanced

552 pages

15h 44m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Foreword
Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
I. Introduction
1. Introduction
The Sysadmin Approach to Service ManagementGoogle’s Approach to Service Management: Site Reliability EngineeringTenets of SREEnsuring a Durable Focus on EngineeringPursuing Maximum Change Velocity Without Violating a Service’s SLOMonitoringEmergency ResponseChange ManagementDemand Forecasting and Capacity PlanningProvisioningEfficiency and PerformanceThe End of the Beginning
2. The Production Environment at Google, from the Viewpoint of an SRE
HardwareSystem Software That “Organizes” the HardwareManaging MachinesStorageNetworkingOther System SoftwareLock ServiceMonitoring and AlertingOur Software InfrastructureOur Development EnvironmentShakespeare: A Sample ServiceLife of a RequestJob and Data Organization
II. Principles
3. Embracing Risk
Managing RiskMeasuring Service RiskRisk Tolerance of ServicesIdentifying the Risk Tolerance of Consumer ServicesIdentifying the Risk Tolerance of Infrastructure ServicesMotivation for Error BudgetsForming Your Error BudgetBenefits
4. Service Level Objectives
Service Level TerminologyIndicatorsObjectivesAgreementsIndicators in PracticeWhat Do You and Your Users Care About?Collecting IndicatorsAggregationStandardize IndicatorsObjectives in PracticeDefining ObjectivesChoosing TargetsControl MeasuresSLOs Set ExpectationsAgreements in Practice
5. Eliminating Toil
Toil DefinedWhy Less Toil Is BetterWhat Qualifies as Engineering?Is Toil Always Bad?Conclusion
6. Monitoring Distributed Systems
DefinitionsWhy Monitor?Setting Reasonable Expectations for MonitoringSymptoms Versus CausesBlack-Box Versus White-BoxThe Four Golden SignalsWorrying About Your Tail (or, Instrumentation and Performance)Choosing an Appropriate Resolution for MeasurementsAs Simple as Possible, No SimplerTying These Principles TogetherMonitoring for the Long TermBigtable SRE: A Tale of Over-AlertingGmail: Predictable, Scriptable Responses from HumansThe Long RunConclusion

7. The Evolution of Automation at Google
The Value of AutomationConsistencyA PlatformFaster RepairsFaster ActionTime SavingThe Value for Google SREThe Use Cases for AutomationGoogle SRE’s Use Cases for AutomationA Hierarchy of Automation ClassesAutomate Yourself Out of a Job: Automate ALL the Things!Soothing the Pain: Applying Automation to Cluster TurnupsDetecting Inconsistencies with ProdtestResolving Inconsistencies IdempotentlyThe Inclination to SpecializeService-Oriented Cluster-TurnupBorg: Birth of the Warehouse-Scale ComputerReliability Is the Fundamental FeatureRecommendations
8. Release Engineering
The Role of a Release EngineerPhilosophySelf-Service ModelHigh VelocityHermetic BuildsEnforcement of Policies and ProceduresContinuous Build and DeploymentBuildingBranchingTestingPackagingRapidDeploymentConfiguration ManagementConclusionsIt’s Not Just for GooglersStart Release Engineering at the Beginning
9. Simplicity
System Stability Versus AgilityThe Virtue of BoringI Won’t Give Up My Code!The “Negative Lines of Code” MetricMinimal APIsModularityRelease SimplicityA Simple Conclusion
III. Practices
10. Practical Alerting from Time-Series Data
The Rise of BorgmonInstrumentation of ApplicationsCollection of Exported DataStorage in the Time-Series ArenaLabels and VectorsRule EvaluationAlertingSharding the Monitoring TopologyBlack-Box MonitoringMaintaining the ConfigurationTen Years On…
11. Being On-Call
IntroductionLife of an On-Call EngineerBalanced On-CallBalance in QuantityBalance in QualityCompensationFeeling SafeAvoiding Inappropriate Operational LoadOperational OverloadA Treacherous Enemy: Operational UnderloadConclusions
12. Effective Troubleshooting
TheoryIn PracticeProblem ReportTriageExamineDiagnoseTest and TreatNegative Results Are MagicCureCase StudyMaking Troubleshooting EasierConclusion
13. Emergency Response
What to Do When Systems BreakTest-Induced EmergencyDetailsResponseFindingsChange-Induced EmergencyDetailsResponseFindingsProcess-Induced EmergencyDetailsResponseFindingsAll Problems Have SolutionsLearn from the Past. Don’t Repeat It.Keep a History of OutagesAsk the Big, Even Improbable, Questions: What If…?Encourage Proactive TestingConclusion
14. Managing Incidents
Unmanaged IncidentsThe Anatomy of an Unmanaged IncidentSharp Focus on the Technical ProblemPoor CommunicationFreelancingElements of Incident Management ProcessRecursive Separation of ResponsibilitiesA Recognized Command PostLive Incident State DocumentClear, Live HandoffA Managed IncidentWhen to Declare an IncidentIn Summary
15. Postmortem Culture: Learning from Failure
Google’s Postmortem PhilosophyCollaborate and Share KnowledgeIntroducing a Postmortem CultureConclusion and Ongoing Improvements
16. Tracking Outages
EscalatorOutalatorAggregationTaggingAnalysisUnexpected Benefits
17. Testing for Reliability
Types of Software TestingTraditional TestsProduction TestsCreating a Test and Build EnvironmentTesting at ScaleTesting Scalable ToolsTesting DisasterThe Need for SpeedPushing to ProductionExpect Testing FailIntegrationProduction ProbesConclusion
18. Software Engineering in SRE
Why Is Software Engineering Within SRE Important?Auxon Case Study: Project Background and Problem SpaceTraditional Capacity PlanningOur Solution: Intent-Based Capacity PlanningIntent-Based Capacity PlanningPrecursors to IntentIntroduction to AuxonRequirements and Implementation: Successes and Lessons LearnedRaising Awareness and Driving AdoptionTeam DynamicsFostering Software Engineering in SRESuccessfully Building a Software Engineering Culture in SRE: Staffing and Development TimeGetting ThereConclusions
19. Load Balancing at the Frontend
Power Isn’t the AnswerLoad Balancing Using DNSLoad Balancing at the Virtual IP Address
20. Load Balancing in the Datacenter
The Ideal CaseIdentifying Bad Tasks: Flow Control and Lame DucksA Simple Approach to Unhealthy Tasks: Flow ControlA Robust Approach to Unhealthy Tasks: Lame Duck StateLimiting the Connections Pool with SubsettingPicking the Right SubsetA Subset Selection Algorithm: Random SubsettingA Subset Selection Algorithm: Deterministic SubsettingLoad Balancing PoliciesSimple Round RobinLeast-Loaded Round RobinWeighted Round Robin
21. Handling Overload
The Pitfalls of “Queries per Second”Per-Customer LimitsClient-Side ThrottlingCriticalityUtilization SignalsHandling Overload ErrorsDeciding to RetryLoad from ConnectionsConclusions
22. Addressing Cascading Failures
Causes of Cascading Failures and Designing to Avoid ThemServer OverloadResource ExhaustionService UnavailabilityPreventing Server OverloadQueue ManagementLoad Shedding and Graceful DegradationRetriesLatency and DeadlinesSlow Startup and Cold CachingAlways Go Downward in the StackTriggering Conditions for Cascading FailuresProcess DeathProcess UpdatesNew RolloutsOrganic GrowthPlanned Changes, Drains, or TurndownsTesting for Cascading FailuresTest Until Failure and BeyondTest Popular ClientsTest Noncritical BackendsImmediate Steps to Address Cascading FailuresIncrease ResourcesStop Health Check Failures/DeathsRestart ServersDrop TrafficEnter Degraded ModesEliminate Batch LoadEliminate Bad TrafficClosing Remarks
23. Managing Critical State: Distributed Consensus for Reliability
Motivating the Use of Consensus: Distributed Systems Coordination FailureCase Study 1: The Split-Brain ProblemCase Study 2: Failover Requires Human InterventionCase Study 3: Faulty Group-Membership AlgorithmsHow Distributed Consensus WorksPaxos Overview: An Example ProtocolSystem Architecture Patterns for Distributed ConsensusReliable Replicated State MachinesReliable Replicated Datastores and Configuration StoresHighly Available Processing Using Leader ElectionDistributed Coordination and Locking ServicesReliable Distributed Queuing and MessagingDistributed Consensus PerformanceMulti-Paxos: Detailed Message FlowScaling Read-Heavy WorkloadsQuorum LeasesDistributed Consensus Performance and Network LatencyReasoning About Performance: Fast PaxosStable LeadersBatchingDisk AccessDeploying Distributed Consensus-Based SystemsNumber of ReplicasLocation of ReplicasCapacity and Load BalancingMonitoring Distributed Consensus SystemsConclusion
24. Distributed Periodic Scheduling with Cron
CronIntroductionReliability PerspectiveCron Jobs and IdempotencyCron at Large ScaleExtended InfrastructureExtended RequirementsBuilding Cron at GoogleTracking the State of Cron JobsThe Use of PaxosThe Roles of the Leader and the FollowerStoring the StateRunning Large CronSummary
25. Data Processing Pipelines
Origin of the Pipeline Design PatternInitial Effect of Big Data on the Simple Pipeline PatternChallenges with the Periodic Pipeline PatternTrouble Caused By Uneven Work DistributionDrawbacks of Periodic Pipelines in Distributed EnvironmentsMonitoring Problems in Periodic Pipelines“Thundering Herd” ProblemsMoiré Load PatternIntroduction to Google WorkflowWorkflow as Model-View-Controller PatternStages of Execution in WorkflowWorkflow Correctness GuaranteesEnsuring Business ContinuitySummary and Concluding Remarks
26. Data Integrity: What You Read Is What You Wrote
Data Integrity’s Strict RequirementsChoosing a Strategy for Superior Data IntegrityBackups Versus ArchivesRequirements of the Cloud Environment in PerspectiveGoogle SRE Objectives in Maintaining Data Integrity and AvailabilityData Integrity Is the Means; Data Availability Is the GoalDelivering a Recovery System, Rather Than a Backup SystemTypes of Failures That Lead to Data LossChallenges of Maintaining Data Integrity Deep and WideHow Google SRE Faces the Challenges of Data IntegrityThe 24 Combinations of Data Integrity Failure ModesFirst Layer: Soft DeletionSecond Layer: Backups and Their Related Recovery MethodsOverarching Layer: Replication1T Versus 1E: Not “Just” a Bigger BackupThird Layer: Early DetectionKnowing That Data Recovery Will WorkCase StudiesGmail—February, 2011: Restore from GTapeGoogle Music—March 2012: Runaway Deletion DetectionGeneral Principles of SRE as Applied to Data IntegrityBeginner’s MindTrust but VerifyHope Is Not a StrategyDefense in DepthConclusion
27. Reliable Product Launches at Scale
Launch Coordination EngineeringThe Role of the Launch Coordination EngineerSetting Up a Launch ProcessThe Launch ChecklistDriving Convergence and SimplificationLaunching the UnexpectedDeveloping a Launch ChecklistArchitecture and DependenciesIntegrationCapacity PlanningFailure ModesClient BehaviorProcesses and AutomationDevelopment ProcessExternal DependenciesRollout PlanningSelected Techniques for Reliable LaunchesGradual and Staged RolloutsFeature Flag FrameworksDealing with Abusive Client BehaviorOverload Behavior and Load TestsDevelopment of LCEEvolution of the LCE ChecklistProblems LCE Didn’t SolveConclusion
IV. Management
28. Accelerating SREs to On-Call and Beyond
You’ve Hired Your Next SRE(s), Now What?Initial Learning Experiences: The Case for Structure Over ChaosLearning Paths That Are Cumulative and OrderlyTargeted Project Work, Not Menial WorkCreating Stellar Reverse Engineers and Improvisational ThinkersReverse Engineers: Figuring Out How Things WorkStatistical and Comparative Thinkers: Stewards of the Scientific Method Under PressureImprov Artists: When the Unexpected HappensTying This Together: Reverse Engineering a Production ServiceFive Practices for Aspiring On-CallersA Hunger for Failure: Reading and Sharing PostmortemsDisaster Role PlayingBreak Real Things, Fix Real ThingsDocumentation as ApprenticeshipShadow On-Call Early and OftenOn-Call and Beyond: Rites of Passage, and Practicing Continuing EducationClosing Thoughts
29. Dealing with Interrupts
Managing Operational LoadFactors in Determining How Interrupts Are HandledImperfect MachinesCognitive Flow StateDo One Thing WellSeriously, Tell Me What to DoReducing Interrupts
30. Embedding an SRE to Recover from Operational Overload
Phase 1: Learn the Service and Get ContextIdentify the Largest Sources of StressIdentify KindlingPhase 2: Sharing ContextWrite a Good Postmortem for the TeamSort Fires According to TypePhase 3: Driving ChangeStart with the BasicsGet Help Clearing KindlingExplain Your ReasoningAsk Leading QuestionsConclusion
31. Communication and Collaboration in SRE
Communications: Production MeetingsAgendaAttendanceCollaboration within SRETeam CompositionTechniques for Working EffectivelyCase Study of Collaboration in SRE: ViceroyThe Coming of the ViceroyChallengesRecommendationsCollaboration Outside SRECase Study: Migrating DFP to F1Conclusion
32. The Evolving SRE Engagement Model
SRE Engagement: What, How, and WhyThe PRR ModelThe SRE Engagement ModelAlternative SupportProduction Readiness Reviews: Simple PRR ModelEngagementAnalysisImprovements and RefactoringTrainingOnboardingContinuous ImprovementEvolving the Simple PRR Model: Early EngagementCandidates for Early EngagementBenefits of the Early Engagement ModelEvolving Services Development: Frameworks and SRE PlatformLessons LearnedExternal Factors Affecting SREToward a Structural Solution: FrameworksNew Service and Management BenefitsConclusion
V. Conclusions
33. Lessons Learned from Other Industries
Meet Our Industry VeteransPreparedness and Disaster TestingRelentless Organizational Focus on SafetyAttention to DetailSwing CapacitySimulations and Live DrillsTraining and CertificationFocus on Detailed Requirements Gathering and DesignDefense in Depth and BreadthPostmortem CultureAutomating Away Repetitive Work and Operational OverheadStructured and Rational Decision MakingConclusions
34. Conclusion
A. Availability Table
B. A Collection of Best Practices for Production Services
Fail SanelyProgressive RolloutsDefine SLOs Like a UserError BudgetsMonitoringPostmortemsCapacity PlanningOverloads and FailureSRE Teams
C. Example Incident State Document
D. Example Postmortem
Lessons LearnedTimelineSupporting information:
E. Launch Coordination Checklist
F. Example Production Meeting Minutes
Bibliography
Index

Content preview from Site Reliability Engineering

Appendix E. Launch Coordination Checklist

This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity:

Architecture

Architecture sketch, types of servers, types of requests from clients
Programmatic client requests

Machines and datacenters

Machines and bandwidth, datacenters, N+2 redundancy, network QoS
New domain names, DNS load balancing

Volume estimates, capacity, and performance

HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out
Load test, end-to-end test, capacity per datacenter at max latency
Impact on other services we care most about
Storage capacity

System reliability and failover

What happens when:
- Machine dies, rack fails, or cluster goes offline
- Network fails between two datacenters

For each type of server that talks to other servers (its backends):
- How to detect when backends die, and what to do when they die
- How to terminate or restart without affecting clients or users
- Load balancing, rate-limiting, timeout, retry and error handling behavior
Data backup/restore, disaster recovery

Monitoring and server management

Monitoring internal state, monitoring end-to-end behavior, managing alerts
Monitoring the monitoring
Financially important alerts and logs
Tips for running servers within cluster environment
Don’t crash mail servers by sending yourself email alerts in your own server code

Security

Security design review, security code audit, spam ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Site Reliability Engineering Fundamentals

Publisher Resources

ISBN: 9781491929117Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Site Reliability Engineering

by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

Appendix E. Launch Coordination Checklist

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.