book

Site Reliability Engineering

by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

April 2016

Intermediate to advanced

552 pages

15h 44m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Conventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
The Sysadmin Approach to Service ManagementGoogle’s Approach to Service Management: Site Reliability EngineeringTenets of SREEnsuring a Durable Focus on EngineeringPursuing Maximum Change Velocity Without Violating a Service’s SLOMonitoringEmergency ResponseChange ManagementDemand Forecasting and Capacity PlanningProvisioningEfficiency and PerformanceThe End of the Beginning
HardwareSystem Software That “Organizes” the HardwareManaging MachinesStorageNetworkingOther System SoftwareLock ServiceMonitoring and AlertingOur Software InfrastructureOur Development EnvironmentShakespeare: A Sample ServiceLife of a RequestJob and Data Organization
Managing RiskMeasuring Service RiskRisk Tolerance of ServicesIdentifying the Risk Tolerance of Consumer ServicesIdentifying the Risk Tolerance of Infrastructure ServicesMotivation for Error BudgetsForming Your Error BudgetBenefits
Service Level TerminologyIndicatorsObjectivesAgreementsIndicators in PracticeWhat Do You and Your Users Care About?Collecting IndicatorsAggregationStandardize IndicatorsObjectives in PracticeDefining ObjectivesChoosing TargetsControl MeasuresSLOs Set ExpectationsAgreements in Practice
Toil DefinedWhy Less Toil Is BetterWhat Qualifies as Engineering?Is Toil Always Bad?Conclusion
DefinitionsWhy Monitor?Setting Reasonable Expectations for MonitoringSymptoms Versus CausesBlack-Box Versus White-BoxThe Four Golden SignalsWorrying About Your Tail (or, Instrumentation and Performance)Choosing an Appropriate Resolution for MeasurementsAs Simple as Possible, No SimplerTying These Principles TogetherMonitoring for the Long TermBigtable SRE: A Tale of Over-AlertingGmail: Predictable, Scriptable Responses from HumansThe Long RunConclusion

The Value of AutomationConsistencyA PlatformFaster RepairsFaster ActionTime SavingThe Value for Google SREThe Use Cases for AutomationGoogle SRE’s Use Cases for AutomationA Hierarchy of Automation ClassesAutomate Yourself Out of a Job: Automate ALL the Things!Soothing the Pain: Applying Automation to Cluster TurnupsDetecting Inconsistencies with ProdtestResolving Inconsistencies IdempotentlyThe Inclination to SpecializeService-Oriented Cluster-TurnupBorg: Birth of the Warehouse-Scale ComputerReliability Is the Fundamental FeatureRecommendations
The Role of a Release EngineerPhilosophySelf-Service ModelHigh VelocityHermetic BuildsEnforcement of Policies and ProceduresContinuous Build and DeploymentBuildingBranchingTestingPackagingRapidDeploymentConfiguration ManagementConclusionsIt’s Not Just for GooglersStart Release Engineering at the Beginning
System Stability Versus AgilityThe Virtue of BoringI Won’t Give Up My Code!The “Negative Lines of Code” MetricMinimal APIsModularityRelease SimplicityA Simple Conclusion
The Rise of BorgmonInstrumentation of ApplicationsCollection of Exported DataStorage in the Time-Series ArenaLabels and VectorsRule EvaluationAlertingSharding the Monitoring TopologyBlack-Box MonitoringMaintaining the ConfigurationTen Years On…
IntroductionLife of an On-Call EngineerBalanced On-CallBalance in QuantityBalance in QualityCompensationFeeling SafeAvoiding Inappropriate Operational LoadOperational OverloadA Treacherous Enemy: Operational UnderloadConclusions
TheoryIn PracticeProblem ReportTriageExamineDiagnoseTest and TreatNegative Results Are MagicCureCase StudyMaking Troubleshooting EasierConclusion
What to Do When Systems BreakTest-Induced EmergencyDetailsResponseFindingsChange-Induced EmergencyDetailsResponseFindingsProcess-Induced EmergencyDetailsResponseFindingsAll Problems Have SolutionsLearn from the Past. Don’t Repeat It.Keep a History of OutagesAsk the Big, Even Improbable, Questions: What If…?Encourage Proactive TestingConclusion
Unmanaged IncidentsThe Anatomy of an Unmanaged IncidentSharp Focus on the Technical ProblemPoor CommunicationFreelancingElements of Incident Management ProcessRecursive Separation of ResponsibilitiesA Recognized Command PostLive Incident State DocumentClear, Live HandoffA Managed IncidentWhen to Declare an IncidentIn Summary
Google’s Postmortem PhilosophyCollaborate and Share KnowledgeIntroducing a Postmortem CultureConclusion and Ongoing Improvements
EscalatorOutalatorAggregationTaggingAnalysisUnexpected Benefits
Types of Software TestingTraditional TestsProduction TestsCreating a Test and Build EnvironmentTesting at ScaleTesting Scalable ToolsTesting DisasterThe Need for SpeedPushing to ProductionExpect Testing FailIntegrationProduction ProbesConclusion
Why Is Software Engineering Within SRE Important?Auxon Case Study: Project Background and Problem SpaceTraditional Capacity PlanningOur Solution: Intent-Based Capacity PlanningIntent-Based Capacity PlanningPrecursors to IntentIntroduction to AuxonRequirements and Implementation: Successes and Lessons LearnedRaising Awareness and Driving AdoptionTeam DynamicsFostering Software Engineering in SRESuccessfully Building a Software Engineering Culture in SRE: Staffing and Development TimeGetting ThereConclusions
Power Isn’t the AnswerLoad Balancing Using DNSLoad Balancing at the Virtual IP Address
The Ideal CaseIdentifying Bad Tasks: Flow Control and Lame DucksA Simple Approach to Unhealthy Tasks: Flow ControlA Robust Approach to Unhealthy Tasks: Lame Duck StateLimiting the Connections Pool with SubsettingPicking the Right SubsetA Subset Selection Algorithm: Random SubsettingA Subset Selection Algorithm: Deterministic SubsettingLoad Balancing PoliciesSimple Round RobinLeast-Loaded Round RobinWeighted Round Robin
The Pitfalls of “Queries per Second”Per-Customer LimitsClient-Side ThrottlingCriticalityUtilization SignalsHandling Overload ErrorsDeciding to RetryLoad from ConnectionsConclusions
Causes of Cascading Failures and Designing to Avoid ThemServer OverloadResource ExhaustionService UnavailabilityPreventing Server OverloadQueue ManagementLoad Shedding and Graceful DegradationRetriesLatency and DeadlinesSlow Startup and Cold CachingAlways Go Downward in the StackTriggering Conditions for Cascading FailuresProcess DeathProcess UpdatesNew RolloutsOrganic GrowthPlanned Changes, Drains, or TurndownsTesting for Cascading FailuresTest Until Failure and BeyondTest Popular ClientsTest Noncritical BackendsImmediate Steps to Address Cascading FailuresIncrease ResourcesStop Health Check Failures/DeathsRestart ServersDrop TrafficEnter Degraded ModesEliminate Batch LoadEliminate Bad TrafficClosing Remarks
Motivating the Use of Consensus: Distributed Systems Coordination FailureCase Study 1: The Split-Brain ProblemCase Study 2: Failover Requires Human InterventionCase Study 3: Faulty Group-Membership AlgorithmsHow Distributed Consensus WorksPaxos Overview: An Example ProtocolSystem Architecture Patterns for Distributed ConsensusReliable Replicated State MachinesReliable Replicated Datastores and Configuration StoresHighly Available Processing Using Leader ElectionDistributed Coordination and Locking ServicesReliable Distributed Queuing and MessagingDistributed Consensus PerformanceMulti-Paxos: Detailed Message FlowScaling Read-Heavy WorkloadsQuorum LeasesDistributed Consensus Performance and Network LatencyReasoning About Performance: Fast PaxosStable LeadersBatchingDisk AccessDeploying Distributed Consensus-Based SystemsNumber of ReplicasLocation of ReplicasCapacity and Load BalancingMonitoring Distributed Consensus SystemsConclusion
CronIntroductionReliability PerspectiveCron Jobs and IdempotencyCron at Large ScaleExtended InfrastructureExtended RequirementsBuilding Cron at GoogleTracking the State of Cron JobsThe Use of PaxosThe Roles of the Leader and the FollowerStoring the StateRunning Large CronSummary
Origin of the Pipeline Design PatternInitial Effect of Big Data on the Simple Pipeline PatternChallenges with the Periodic Pipeline PatternTrouble Caused By Uneven Work DistributionDrawbacks of Periodic Pipelines in Distributed EnvironmentsMonitoring Problems in Periodic Pipelines“Thundering Herd” ProblemsMoiré Load PatternIntroduction to Google WorkflowWorkflow as Model-View-Controller PatternStages of Execution in WorkflowWorkflow Correctness GuaranteesEnsuring Business ContinuitySummary and Concluding Remarks
Data Integrity’s Strict RequirementsChoosing a Strategy for Superior Data IntegrityBackups Versus ArchivesRequirements of the Cloud Environment in PerspectiveGoogle SRE Objectives in Maintaining Data Integrity and AvailabilityData Integrity Is the Means; Data Availability Is the GoalDelivering a Recovery System, Rather Than a Backup SystemTypes of Failures That Lead to Data LossChallenges of Maintaining Data Integrity Deep and WideHow Google SRE Faces the Challenges of Data IntegrityThe 24 Combinations of Data Integrity Failure ModesFirst Layer: Soft DeletionSecond Layer: Backups and Their Related Recovery MethodsOverarching Layer: Replication1T Versus 1E: Not “Just” a Bigger BackupThird Layer: Early DetectionKnowing That Data Recovery Will WorkCase StudiesGmail—February, 2011: Restore from GTapeGoogle Music—March 2012: Runaway Deletion DetectionGeneral Principles of SRE as Applied to Data IntegrityBeginner’s MindTrust but VerifyHope Is Not a StrategyDefense in DepthConclusion
Launch Coordination EngineeringThe Role of the Launch Coordination EngineerSetting Up a Launch ProcessThe Launch ChecklistDriving Convergence and SimplificationLaunching the UnexpectedDeveloping a Launch ChecklistArchitecture and DependenciesIntegrationCapacity PlanningFailure ModesClient BehaviorProcesses and AutomationDevelopment ProcessExternal DependenciesRollout PlanningSelected Techniques for Reliable LaunchesGradual and Staged RolloutsFeature Flag FrameworksDealing with Abusive Client BehaviorOverload Behavior and Load TestsDevelopment of LCEEvolution of the LCE ChecklistProblems LCE Didn’t SolveConclusion
You’ve Hired Your Next SRE(s), Now What?Initial Learning Experiences: The Case for Structure Over ChaosLearning Paths That Are Cumulative and OrderlyTargeted Project Work, Not Menial WorkCreating Stellar Reverse Engineers and Improvisational ThinkersReverse Engineers: Figuring Out How Things WorkStatistical and Comparative Thinkers: Stewards of the Scientific Method Under PressureImprov Artists: When the Unexpected HappensTying This Together: Reverse Engineering a Production ServiceFive Practices for Aspiring On-CallersA Hunger for Failure: Reading and Sharing PostmortemsDisaster Role PlayingBreak Real Things, Fix Real ThingsDocumentation as ApprenticeshipShadow On-Call Early and OftenOn-Call and Beyond: Rites of Passage, and Practicing Continuing EducationClosing Thoughts
Managing Operational LoadFactors in Determining How Interrupts Are HandledImperfect MachinesCognitive Flow StateDo One Thing WellSeriously, Tell Me What to DoReducing Interrupts
Phase 1: Learn the Service and Get ContextIdentify the Largest Sources of StressIdentify KindlingPhase 2: Sharing ContextWrite a Good Postmortem for the TeamSort Fires According to TypePhase 3: Driving ChangeStart with the BasicsGet Help Clearing KindlingExplain Your ReasoningAsk Leading QuestionsConclusion
Communications: Production MeetingsAgendaAttendanceCollaboration within SRETeam CompositionTechniques for Working EffectivelyCase Study of Collaboration in SRE: ViceroyThe Coming of the ViceroyChallengesRecommendationsCollaboration Outside SRECase Study: Migrating DFP to F1Conclusion
SRE Engagement: What, How, and WhyThe PRR ModelThe SRE Engagement ModelAlternative SupportProduction Readiness Reviews: Simple PRR ModelEngagementAnalysisImprovements and RefactoringTrainingOnboardingContinuous ImprovementEvolving the Simple PRR Model: Early EngagementCandidates for Early EngagementBenefits of the Early Engagement ModelEvolving Services Development: Frameworks and SRE PlatformLessons LearnedExternal Factors Affecting SREToward a Structural Solution: FrameworksNew Service and Management BenefitsConclusion
Meet Our Industry VeteransPreparedness and Disaster TestingRelentless Organizational Focus on SafetyAttention to DetailSwing CapacitySimulations and Live DrillsTraining and CertificationFocus on Detailed Requirements Gathering and DesignDefense in Depth and BreadthPostmortem CultureAutomating Away Repetitive Work and Operational OverheadStructured and Rational Decision MakingConclusions
Fail SanelyProgressive RolloutsDefine SLOs Like a UserError BudgetsMonitoringPostmortemsCapacity PlanningOverloads and FailureSRE Teams
Lessons LearnedTimelineSupporting information:

Content preview from Site Reliability Engineering

Chapter 15. Postmortem Culture: Learning from Failure

Written by John Lunney and Sue Lueder

Edited by Gary O’ Connor

The cost of failure is education.

Devin Carraway

As SREs, we work with large-scale, complex, distributed systems. We constantly enhance our services with new features and add new systems. Incidents and outages are inevitable given our scale and velocity of change. When an incident occurs, we fix the underlying issue, and services return to their normal operating conditions. Unless we have some formalized process of learning from these incidents in place, they may recur ad infinitum. Left unchecked, incidents can multiply in complexity or even cascade, overwhelming a system and its operators and ultimately impacting our users. Therefore, postmortems are an essential tool for SRE.

The postmortem concept is well known in the technology industry [All12]. A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring. This chapter describes criteria for deciding when to conduct postmortems, some best practices around postmortems, and advice on how to cultivate a postmortem culture based on the experience we’ve gained over the years.

Google’s Postmortem Philosophy

The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive ...