book

Seeking SRE

Name: Seeking SRE
Author: David N. Blank-Edelman
ISBN: 9781491978863

by David N. Blank-Edelman

September 2018

Intermediate to advanced

587 pages

17h 34m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
And So It Begins...Origin StoryVoicesForward in All Directions!1Acknowledgments
I. SRE Implementation
1. Context Versus Control in SRE
2. Interviewing Site Reliability Engineers
Interviewing 101Who Is InvolvedIndustry Versus UniversityBiasesThe FunnelSRE FunnelsPhone ScreensThe Onsite InterviewTake-Home QuestionsAdvice for Hiring ManagersFinal Thoughts on Interviewing SREsFurther Reading
3. So, You Want to Build an SRE Team?
Choose SRE for the Right ReasonsOrienting to a Data-Driven ApproachCommitment to SREMaking a Decision About SRE
4. Using Incident Metrics to Improve SRE at Scale
The Virtuous Cycle to the Rescue: If You Don’t Measure It…Metrics Review: If a Metric Falls in the Forest…Surrogate MetricsRepair DebtVirtual Repair Debt: Exorcising the Ghost in the MachineReal-Time Dashboards: The Bread and Butter of SRELearnings: TL;DRFurther Reading
5. Working with Third Parties Shouldn’t Suck
Build, Buy, or Adopt?Establish ImportanceIdentify StakeholdersMake a DecisionAcknowledge RealityThird Parties as First-Class CitizensWhen They’re Down, You’re DownRunning the Black Box Like a ServiceService-Level Indicators, Service-Level Objectives, and SLAsPlaybook: From Staging to ProductionClosing Thoughts
6. How to Apply SRE Principles Without Dedicated SRE Teams
SREs to the Rescue! (and How They Failed)A Matter of Scale in Terms of HeadcountThe Embedded SREYou Build It, You Run ItThe Deployment PlatformClosing the Loop: Take Your Own PagerIntroducing Production EngineeringSome Implementation DetailsDevelopers’ Productivity and Health Versus the PagerResolving Cross-Team Reliability Issues by Using PostmortemsUniform Infrastructure and Tooling Versus Autonomy and InnovationGetting Buy-InConclusionFurther Reading
7. SRE Without SRE: The Spotify Case Study
Tabula Rasa: 2006–2007PreludeKey LearningsBeta and Release: 2008–2009PreludeBringing Scalability and Reliability to the ForefrontKey LearningsThe Curse of Success: 2010PreludeA New Ownership ModelFormalizing Core ServicesBlessed Deployment Time SlotsOn-Call and AlertingSpawning Off Internal Office SupportAddressing the Remaining Top ConcernsCreating DetectivesKey LearningsPets and Cattle, and Agile: 2011PreludeForming Bad HabitsBreaking Those Bad HabitsKey LearningsA System That Didn’t Scale: 2012PreludeManual Work Hits a CliffKey LearningsIntroducing Ops-in-Squads: 2013–2015PreludeBuilding on TrustDriving the Paradigm ShiftKey LearningsAutonomy Versus Consistency: 2015–2017PreludeBenefitsTrade-OffsKey LearningsThe Future: Speed at Scale, Safely
8. Introducing SRE in Large Enterprises
BackgroundIntroducing SREDefining Current StateIdentifying and Educating StakeholdersPresenting the Business CaseImplementing the SRE TeamLessons LearnedSample Implementation RoadmapClosing ThoughtsFurther Reading

9. From SysAdmin to SRE in 8,963 Words
Clarifying TerminologyService-Level IndicatorSLAService-Level ObjectiveEstablishing SLAs for Internal ComponentsUnderstanding External DependenciesNontechnical SolutionsTracking Availability LevelDealing with Corner CasesConclusion
10. Clearing the Way for SRE in the Enterprise
Toil, the Enemy of SREToil in the EnterpriseSilos, Queues, and TicketsSilos Get in the WayTicket-Driven Request Queues Are ExpensiveTake Action NowStart by Leaning on LeanGet Rid of as Many Handoffs as PossibleReplace Remaining Handoffs with Self-ServiceSelf-Service Is More Than a ButtonSelf-Service Helps SREs in Multiple WaysOperations as a ServiceError Budgets, Toil Limits, and Other Tools for Empowering HumansError BudgetsToil LimitsLeverage Existing Enthusiasm for DevOpsUnify Backlogs and Protect CapacityPsychological Safety and Human FactorsJoin the Movement
11. SRE Patterns Loved by DevOps People Everywhere
Pattern 1: Birth of Automated Testing at GooglePattern 2: Launch and Handoff Readiness Review at GooglePattern 3: Create a Shared Source Code RepositoryConclusionFurther Reading and Source Material
12. DevOps and SRE: Voices from the Community
BackgroundMethodResultsReplies
13. Production Engineering at Facebook
II. Near Edge SRE
14. In the Beginning, There Was Chaos
The Problem with SystemsEconomic Pillars of ComplexityBeginning ChaosNavigating Complexity for SafetyChaos Goes BigFormalizationAdvanced PrinciplesFrequently Asked QuestionsConclusion
15. The Intersection of Reliability and Privacy
The Intersection of Reliability and PrivacyThe General Landscape of Privacy EngineeringPrivacy and SRE: Common ApproachesReducing ToilEfficient and Deliberate Problem SolvingRelationship ManagementEarly Intervention and Education Through EvangelismNuances, Differences, and Trade-OffsConclusionFurther Reading
16. Database Reliability Engineering
Guiding Principles of the Database Reliability EngineerProtect the DataSelf-Service for ScaleDatabases Are Not SpecialA Culture of Database Reliability EngineeringRecoverabilityConsiderations for RecoveryAnatomy of a Recovery StrategyBuilding Block 1: DetectionBuilding Block 2: Diverse StorageBuilding Block 3: A Varied ToolboxBuilding Block 4: TestingChampioning Recovery ReliabilityContinuous Delivery: From Development to ProductionEducation and CollaborationCollaborationDeploymentMigrations and VersioningImpact AnalysisMigration PatternsChampioning CDMaking the Case for DBREFurther Reading
17. Engineering for Data Durability
Replication Is Table StakesBackupsReplicationReal-World DurabilityIsolationProtectionTestingSafeguardsRecoveryVerificationThe Power of ZeroVerification CoverageWatching the WatchersAutomationWindow of VulnerabilityOperator FatigueReliabilityConclusion
18. Introduction to Machine Learning for SRE
Why Use Machine Learning for SRE?Why and How Should My Company Be Engaging in This?Some SRE Problems Machine Learning Can Help SolveThe Awakening of Applied AIWhat Is Machine Learning?What Do We Mean by Learning?From Chess to Go: How Deep Can We Dive?Why Now? What Changed for Us?What Are Neural Networks?Neurons and Neural NetworksHow and When Should We Apply Neural Networks?What Kinds of Data Can We Use?Practical Machine LearningPopular Libraries for Neural NetworksPractical Machine Learning ExamplesSuccess StoriesFurther ReadingMy GitHub RepositoryRecommended Books
III. SRE Best Practices and Technologies
19. Do Docs Better: Integrating Documentation into the Engineering Workflow
Defining Quality: What Do Good Docs Look Like?Functional Requirements for SRE DocumentationIntegrating Docs into the Engineering WorkflowThe Google Experience: g3doc and EngPlayWhat We LearnedDoing Docs Better: Best PracticesCreate Templates for Each Documentation TypeBetter > Best: Set Realistic Standards for QualityRequire Docs as Part of Code ReviewRuthlessly Prune Your DocsRecognize and Reward DocumentationCommunicating the Value of DocumentationFurther Reading
20. Active Teaching and Learning
Active LearningActive Learning Example: Wheel of MisfortuneActive Learning Example: Incident Manager (a Card Game)Active Learning Example: SRE ClassroomThe Costs of Failing to LearnLearning Habits of Effective SRE TeamsProduction MeetingsPostmortemsA Call to Action: Ditch the Boring Slides
21. The Art and Science of the Service-Level Objective
Why Set Goals?AvailabilityTime QuantaTransactionsTransactions over Time QuantaOn Evaluating SLOsHistogramsWhere Percentiles Fall Down (and Histograms Step Up)Parting Thought: Looking at SLOs Upside DownFurther Reading
22. SRE as a Success Culture
Where Did SRE Come From?Key Values for SREKeeping the Site UpEmpowering Teams to “Do the Right Thing”Approaching Operations as an Engineering ProblemAchieving Business Success Through Promises (Service Levels)Critical Enabling Functions of SREMonitoring, Metrics, and KPIsIncident Management and Emergency ResponseCapacity Planning and Demand ForecastingPerformance Analysis and OptimizationProvisioning, Change Management, and VelocityPhases of SRE ExecutionPhase 1: Firefighting/ReactivePhase 2: GatekeepersPhase 3: Advocates/PartnersPhase 4: CatalyticComplications of Differing PhasesFocus on the Details of SuccessFurther Reading
23. SRE Antipatterns
Antipattern 1: Site Reliability OperationsAntipattern 2: Humans Staring at ScreensAntipattern 3: Mob Incident ResponseAntipattern 4: Root Cause = Human ErrorAntipattern 5: Passing the PagerAntipattern 6: Magic Smoke Jumping!Antipattern 7: Alert Reliability EngineeringAntipattern 8: Hiring a Dog-Walker to Tend Your PetsAntipattern 9: Speed-Bump EngineeringAntipattern 10: Design ChokepointsAntipattern 11: Too Much Stick, Not Enough CarrotAntipattern 12: Postponing ProductionAntipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)Antipattern 14: Dependency HellAntipattern 15: Ungainly GovernanceAntipattern 16: Ill-Considered SLOh-OhsAntipattern 17: Tossing Your API Over the FirewallAntipattern 18: Fixing the Ops TeamSo, That’s It, Then?
24. Immutable Infrastructure and SRE
Scalability, Reliability, and PerformanceFailure RecoverySimpler OperationsFaster Startup TimesKnown StateContinuous Integration/Continuous Deployment with ConfidenceSecurityMultiregion OperationsRelease EngineeringBuilding the Base ImageDeploying ApplicationsDisadvantagesConclusion
25. Scriptable Load Balancers
Scriptable Load Balancers: The New Kid on the BlockWhy Scriptable Load Balancers?Making the Difficult EasyShard-Aware RoutingHarnessing PotentialCase Study: IntermissionService-Level MiddlewareMiddleware to the RescueAPIs of Service-Level MiddlewareCase Study: WAF/Bot MitigationAvoiding DisasterGetting Clever with StateCase Study: Checkout QueueLooking to the Future and Further Reading
26. The Service Mesh: Wrangler of Your Microservices?
Ready to Get Rid of the Monolith?Current State of Microservice NetworkingService Mesh to the RescueThe Benefits of a Sidecar ProxyEventually Consistent Service DiscoveryObservability and AlarmingSidecar Performance ImplicationsThin Libraries and Context PropagationConfiguration Management (Control Plane Versus Data Plane)The Service Mesh in PracticeThe Origin and Development of Envoy at LyftOperating Envoy at LyftThe Future of the Service MeshFurther Reading
IV. The Human Side of SRE
27. Psychological Safety in SRE
The Primary Indicator of a Successful TeamHow to Build Psychological Safety into Your Own TeamFurther Reading
28. SRE Cognitive Work
IntroductionWhat Do SRE People Do?Why Should We Care About Practitioner Cognition?Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be ScriptedHuman Performance in Modern Complex Systems: The Main ThemesObservations on SRE Cognitive Work Around IncidentsEvery Incident Could Have Been WorseSacrifice Decisions Take Place Under UncertaintyRepairs to Functional SystemsSpecial Knowledge About Complex SystemsManaging the Costs of CoordinationSREs Are Cognitive Agents Working in a Joint Cognitive SystemThe Calibration ProblemMental ModelsIncidents Trigger Individual RecalibrationIncidents Are Opportunities for Collective RecalibrationWhat Are the Implications of All This?Incidents Will ContinueIncidents Will Impose CostsIncident Patterns Will ChangeIncidents Point to Specific Calibration Problems and LocationsWhat Should Happen Next?Build a Corpus of CasesFocus on Making Automation a Team Player in SRE WorkAddress the Calibration ProblemWhat Can You Do?ConclusionReferences
29. Beyond Burnout
Defining Mental DisordersMental Disorders Are Missing from the Diversity ConversationSanity Isn’t a Business RequirementThoughts and Prayers Aren’t ScalableFull-Stack InclusivityApplicationInterviewingCompensationBenefitsOnboardingWorking ConditionsJob DutiesTrainingPromotionLeavingInclusivity for Anyone Helps EveryoneMental Disorder Resources
30. Against On-Call: A Polemic
The Rationale for On-CallFirst, Do No HarmParallels with SREDifferences with SREUnderlying Assumptions Driving On-Call for EngineersOn-Call Is Emergency Medicine Instead of Ward MedicineCounterargumentsThe Cost to Humans of Doing On-CallWe don’t need another heroActual SolutionsTrainingPrioritizationImproving On-the-Job PerformanceWe Need a Fundamental Change in ApproachStrong-Anti-On-CallWeak-Anti-On-CallA Union of the TwoConclusion
31. Elegy for Complex Systems
The Computer and Human Systems Cannot Be SeparatedDecoherence and Cascading FailureAlways in a State of Partial FailureNovelty Priority InversionNobody Anticipates the Overhead of CoordinationYour healthcare.gov Is Out ThereTo Get InvolvedFurther Reading
32. Intersections Between Operations and Social Activism
Before, During, AfterCreating the Perfect PlanPrinciples of OrganizingManaging Crisis: Responding When Things Break DownWriting Our Own History: Making Sense of What Went DownThe Long Tail: Turning Action into ChangeActivism and Change Within a CompanyConclusion
33. Conclusion
Index

Content preview from Seeking SRE

Chapter 14. In the Beginning, There Was Chaos

Casey Rosenthal, Backplane.io (formerly Netflix)

Services go down and people have a bad time. Customers who rely on the service become frustrated, other systems that rely on the service stop working, and the people responsible for the system are paged. History suggests¹ that even the most celebrated online services are vulnerable to outages, even with hundreds and sometimes thousands of people dedicated to their operation and uptime. As software inexorably increases in complexity,² old methods of preventing errors and outages prove insufficient.

In the not-so-distant past, best practices around testing, code style, and process gave us confidence that the code that we wrote and deployed would do what we expected it to do. We believe that practices like rigorous testing, Test-Driven Development (TDD), Agile feedback loops, pair programming, and many others can help reduce bugs in the long run. Practices like these are still very important, but they are not sufficient for engineering modern complex systems.

New best practices are needed to give us confidence again in the systems that we build. Best practices are emerging to meet this need, and chaos engineering is among them. Chaos engineering is a new discipline pioneered at Netflix specifically designed to optimize for availability in complex, distributed systems. We can have our confidence, and engineer it, too.

I ran the Chaos Team at Netflix for three years, during the period when ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491978856Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Seeking SRE

by David N. Blank-Edelman

Chapter 14. In the Beginning, There Was Chaos

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.