book

Site Reliability Engineering

by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

April 2016

Intermediate to advanced

552 pages

15h 44m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Conventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
The Sysadmin Approach to Service ManagementGoogle’s Approach to Service Management: Site Reliability EngineeringTenets of SREEnsuring a Durable Focus on EngineeringPursuing Maximum Change Velocity Without Violating a Service’s SLOMonitoringEmergency ResponseChange ManagementDemand Forecasting and Capacity PlanningProvisioningEfficiency and PerformanceThe End of the Beginning
HardwareSystem Software That “Organizes” the HardwareManaging MachinesStorageNetworkingOther System SoftwareLock ServiceMonitoring and AlertingOur Software InfrastructureOur Development EnvironmentShakespeare: A Sample ServiceLife of a RequestJob and Data Organization
Managing RiskMeasuring Service RiskRisk Tolerance of ServicesIdentifying the Risk Tolerance of Consumer ServicesIdentifying the Risk Tolerance of Infrastructure ServicesMotivation for Error BudgetsForming Your Error BudgetBenefits
Service Level TerminologyIndicatorsObjectivesAgreementsIndicators in PracticeWhat Do You and Your Users Care About?Collecting IndicatorsAggregationStandardize IndicatorsObjectives in PracticeDefining ObjectivesChoosing TargetsControl MeasuresSLOs Set ExpectationsAgreements in Practice
Toil DefinedWhy Less Toil Is BetterWhat Qualifies as Engineering?Is Toil Always Bad?Conclusion
DefinitionsWhy Monitor?Setting Reasonable Expectations for MonitoringSymptoms Versus CausesBlack-Box Versus White-BoxThe Four Golden SignalsWorrying About Your Tail (or, Instrumentation and Performance)Choosing an Appropriate Resolution for MeasurementsAs Simple as Possible, No SimplerTying These Principles TogetherMonitoring for the Long TermBigtable SRE: A Tale of Over-AlertingGmail: Predictable, Scriptable Responses from HumansThe Long RunConclusion

The Value of AutomationConsistencyA PlatformFaster RepairsFaster ActionTime SavingThe Value for Google SREThe Use Cases for AutomationGoogle SRE’s Use Cases for AutomationA Hierarchy of Automation ClassesAutomate Yourself Out of a Job: Automate ALL the Things!Soothing the Pain: Applying Automation to Cluster TurnupsDetecting Inconsistencies with ProdtestResolving Inconsistencies IdempotentlyThe Inclination to SpecializeService-Oriented Cluster-TurnupBorg: Birth of the Warehouse-Scale ComputerReliability Is the Fundamental FeatureRecommendations
The Role of a Release EngineerPhilosophySelf-Service ModelHigh VelocityHermetic BuildsEnforcement of Policies and ProceduresContinuous Build and DeploymentBuildingBranchingTestingPackagingRapidDeploymentConfiguration ManagementConclusionsIt’s Not Just for GooglersStart Release Engineering at the Beginning
System Stability Versus AgilityThe Virtue of BoringI Won’t Give Up My Code!The “Negative Lines of Code” MetricMinimal APIsModularityRelease SimplicityA Simple Conclusion
The Rise of BorgmonInstrumentation of ApplicationsCollection of Exported DataStorage in the Time-Series ArenaLabels and VectorsRule EvaluationAlertingSharding the Monitoring TopologyBlack-Box MonitoringMaintaining the ConfigurationTen Years On…
IntroductionLife of an On-Call EngineerBalanced On-CallBalance in QuantityBalance in QualityCompensationFeeling SafeAvoiding Inappropriate Operational LoadOperational OverloadA Treacherous Enemy: Operational UnderloadConclusions
TheoryIn PracticeProblem ReportTriageExamineDiagnoseTest and TreatNegative Results Are MagicCureCase StudyMaking Troubleshooting EasierConclusion
What to Do When Systems BreakTest-Induced EmergencyDetailsResponseFindingsChange-Induced EmergencyDetailsResponseFindingsProcess-Induced EmergencyDetailsResponseFindingsAll Problems Have SolutionsLearn from the Past. Don’t Repeat It.Keep a History of OutagesAsk the Big, Even Improbable, Questions: What If…?Encourage Proactive TestingConclusion
Unmanaged IncidentsThe Anatomy of an Unmanaged IncidentSharp Focus on the Technical ProblemPoor CommunicationFreelancingElements of Incident Management ProcessRecursive Separation of ResponsibilitiesA Recognized Command PostLive Incident State DocumentClear, Live HandoffA Managed IncidentWhen to Declare an IncidentIn Summary
Google’s Postmortem PhilosophyCollaborate and Share KnowledgeIntroducing a Postmortem CultureConclusion and Ongoing Improvements
EscalatorOutalatorAggregationTaggingAnalysisUnexpected Benefits
Types of Software TestingTraditional TestsProduction TestsCreating a Test and Build EnvironmentTesting at ScaleTesting Scalable ToolsTesting DisasterThe Need for SpeedPushing to ProductionExpect Testing FailIntegrationProduction ProbesConclusion
Why Is Software Engineering Within SRE Important?Auxon Case Study: Project Background and Problem SpaceTraditional Capacity PlanningOur Solution: Intent-Based Capacity PlanningIntent-Based Capacity PlanningPrecursors to IntentIntroduction to AuxonRequirements and Implementation: Successes and Lessons LearnedRaising Awareness and Driving AdoptionTeam DynamicsFostering Software Engineering in SRESuccessfully Building a Software Engineering Culture in SRE: Staffing and Development TimeGetting ThereConclusions
Power Isn’t the AnswerLoad Balancing Using DNSLoad Balancing at the Virtual IP Address
The Ideal CaseIdentifying Bad Tasks: Flow Control and Lame DucksA Simple Approach to Unhealthy Tasks: Flow ControlA Robust Approach to Unhealthy Tasks: Lame Duck StateLimiting the Connections Pool with SubsettingPicking the Right SubsetA Subset Selection Algorithm: Random SubsettingA Subset Selection Algorithm: Deterministic SubsettingLoad Balancing PoliciesSimple Round RobinLeast-Loaded Round RobinWeighted Round Robin
The Pitfalls of “Queries per Second”Per-Customer LimitsClient-Side ThrottlingCriticalityUtilization SignalsHandling Overload ErrorsDeciding to RetryLoad from ConnectionsConclusions
Causes of Cascading Failures and Designing to Avoid ThemServer OverloadResource ExhaustionService UnavailabilityPreventing Server OverloadQueue ManagementLoad Shedding and Graceful DegradationRetriesLatency and DeadlinesSlow Startup and Cold CachingAlways Go Downward in the StackTriggering Conditions for Cascading FailuresProcess DeathProcess UpdatesNew RolloutsOrganic GrowthPlanned Changes, Drains, or TurndownsTesting for Cascading FailuresTest Until Failure and BeyondTest Popular ClientsTest Noncritical BackendsImmediate Steps to Address Cascading FailuresIncrease ResourcesStop Health Check Failures/DeathsRestart ServersDrop TrafficEnter Degraded ModesEliminate Batch LoadEliminate Bad TrafficClosing Remarks
Motivating the Use of Consensus: Distributed Systems Coordination FailureCase Study 1: The Split-Brain ProblemCase Study 2: Failover Requires Human InterventionCase Study 3: Faulty Group-Membership AlgorithmsHow Distributed Consensus WorksPaxos Overview: An Example ProtocolSystem Architecture Patterns for Distributed ConsensusReliable Replicated State MachinesReliable Replicated Datastores and Configuration StoresHighly Available Processing Using Leader ElectionDistributed Coordination and Locking ServicesReliable Distributed Queuing and MessagingDistributed Consensus PerformanceMulti-Paxos: Detailed Message FlowScaling Read-Heavy WorkloadsQuorum LeasesDistributed Consensus Performance and Network LatencyReasoning About Performance: Fast PaxosStable LeadersBatchingDisk AccessDeploying Distributed Consensus-Based SystemsNumber of ReplicasLocation of ReplicasCapacity and Load BalancingMonitoring Distributed Consensus SystemsConclusion
CronIntroductionReliability PerspectiveCron Jobs and IdempotencyCron at Large ScaleExtended InfrastructureExtended RequirementsBuilding Cron at GoogleTracking the State of Cron JobsThe Use of PaxosThe Roles of the Leader and the FollowerStoring the StateRunning Large CronSummary
Origin of the Pipeline Design PatternInitial Effect of Big Data on the Simple Pipeline PatternChallenges with the Periodic Pipeline PatternTrouble Caused By Uneven Work DistributionDrawbacks of Periodic Pipelines in Distributed EnvironmentsMonitoring Problems in Periodic Pipelines“Thundering Herd” ProblemsMoiré Load PatternIntroduction to Google WorkflowWorkflow as Model-View-Controller PatternStages of Execution in WorkflowWorkflow Correctness GuaranteesEnsuring Business ContinuitySummary and Concluding Remarks
Data Integrity’s Strict RequirementsChoosing a Strategy for Superior Data IntegrityBackups Versus ArchivesRequirements of the Cloud Environment in PerspectiveGoogle SRE Objectives in Maintaining Data Integrity and AvailabilityData Integrity Is the Means; Data Availability Is the GoalDelivering a Recovery System, Rather Than a Backup SystemTypes of Failures That Lead to Data LossChallenges of Maintaining Data Integrity Deep and WideHow Google SRE Faces the Challenges of Data IntegrityThe 24 Combinations of Data Integrity Failure ModesFirst Layer: Soft DeletionSecond Layer: Backups and Their Related Recovery MethodsOverarching Layer: Replication1T Versus 1E: Not “Just” a Bigger BackupThird Layer: Early DetectionKnowing That Data Recovery Will WorkCase StudiesGmail—February, 2011: Restore from GTapeGoogle Music—March 2012: Runaway Deletion DetectionGeneral Principles of SRE as Applied to Data IntegrityBeginner’s MindTrust but VerifyHope Is Not a StrategyDefense in DepthConclusion
Launch Coordination EngineeringThe Role of the Launch Coordination EngineerSetting Up a Launch ProcessThe Launch ChecklistDriving Convergence and SimplificationLaunching the UnexpectedDeveloping a Launch ChecklistArchitecture and DependenciesIntegrationCapacity PlanningFailure ModesClient BehaviorProcesses and AutomationDevelopment ProcessExternal DependenciesRollout PlanningSelected Techniques for Reliable LaunchesGradual and Staged RolloutsFeature Flag FrameworksDealing with Abusive Client BehaviorOverload Behavior and Load TestsDevelopment of LCEEvolution of the LCE ChecklistProblems LCE Didn’t SolveConclusion
You’ve Hired Your Next SRE(s), Now What?Initial Learning Experiences: The Case for Structure Over ChaosLearning Paths That Are Cumulative and OrderlyTargeted Project Work, Not Menial WorkCreating Stellar Reverse Engineers and Improvisational ThinkersReverse Engineers: Figuring Out How Things WorkStatistical and Comparative Thinkers: Stewards of the Scientific Method Under PressureImprov Artists: When the Unexpected HappensTying This Together: Reverse Engineering a Production ServiceFive Practices for Aspiring On-CallersA Hunger for Failure: Reading and Sharing PostmortemsDisaster Role PlayingBreak Real Things, Fix Real ThingsDocumentation as ApprenticeshipShadow On-Call Early and OftenOn-Call and Beyond: Rites of Passage, and Practicing Continuing EducationClosing Thoughts
Managing Operational LoadFactors in Determining How Interrupts Are HandledImperfect MachinesCognitive Flow StateDo One Thing WellSeriously, Tell Me What to DoReducing Interrupts
Phase 1: Learn the Service and Get ContextIdentify the Largest Sources of StressIdentify KindlingPhase 2: Sharing ContextWrite a Good Postmortem for the TeamSort Fires According to TypePhase 3: Driving ChangeStart with the BasicsGet Help Clearing KindlingExplain Your ReasoningAsk Leading QuestionsConclusion
Communications: Production MeetingsAgendaAttendanceCollaboration within SRETeam CompositionTechniques for Working EffectivelyCase Study of Collaboration in SRE: ViceroyThe Coming of the ViceroyChallengesRecommendationsCollaboration Outside SRECase Study: Migrating DFP to F1Conclusion
SRE Engagement: What, How, and WhyThe PRR ModelThe SRE Engagement ModelAlternative SupportProduction Readiness Reviews: Simple PRR ModelEngagementAnalysisImprovements and RefactoringTrainingOnboardingContinuous ImprovementEvolving the Simple PRR Model: Early EngagementCandidates for Early EngagementBenefits of the Early Engagement ModelEvolving Services Development: Frameworks and SRE PlatformLessons LearnedExternal Factors Affecting SREToward a Structural Solution: FrameworksNew Service and Management BenefitsConclusion
Meet Our Industry VeteransPreparedness and Disaster TestingRelentless Organizational Focus on SafetyAttention to DetailSwing CapacitySimulations and Live DrillsTraining and CertificationFocus on Detailed Requirements Gathering and DesignDefense in Depth and BreadthPostmortem CultureAutomating Away Repetitive Work and Operational OverheadStructured and Rational Decision MakingConclusions
Fail SanelyProgressive RolloutsDefine SLOs Like a UserError BudgetsMonitoringPostmortemsCapacity PlanningOverloads and FailureSRE Teams
Lessons LearnedTimelineSupporting information:

Content preview from Site Reliability Engineering

Chapter 6. Monitoring Distributed Systems

Written by Rob Ewaschuk

Edited by Betsy Beyer

Google’s SRE teams have some basic principles and best practices for building successful monitoring and alerting systems. This chapter offers guidelines for what issues should interrupt a human via a page, and how to deal with issues that aren’t serious enough to trigger a page.

Definitions

There’s no uniformly shared vocabulary for discussing all topics related to monitoring. Even within Google, usage of the following terms varies, but the most common interpretations are listed here.

Monitoring: Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.
White-box monitoring: Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics.
Black-box monitoring: Testing externally visible behavior as a user would see it.
Dashboard: An application (usually web-based) that provides a summary view of a service’s core metrics. A dashboard may have filters, selectors, and so on, but is prebuilt to expose the metrics most important to its users. The dashboard might also display team information such as ticket queue length, a list of high-priority bugs, the current on-call engineer for a given area of responsibility, or recent pushes. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial