book

Observability Engineering, 2nd Edition

by Charity Majors, Liz Fong-Jones, George Miranda

June 2026

Intermediate to advanced

628 pages

18h 11m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Why We Wrote This BookWhat’s Different in the Second EditionHow to Read This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments for the First EditionAcknowledgments for the Second Edition
I. Introduction to Observability
1. What Is Observability?
The Origins of ObservabilityApplying Observability to Software SystemsProperties of Software DependabilityObservability is a Property of Dependable SoftwareHow Observable is Your Software?Two Competing Models for Telemetry: Three Pillars Versus Unified DataThe Three Pillars ModelThe Unified Storage ModelObservability is the Validation of Developer IntentThe Agentic Incursion Has Just BegunGuardrails Are Having a MomentConclusion
2. How Code Crosses Over: Validating Developer Intent in Production
What Makes Software Good?Good Software Serves Its PurposeGood Software Delivers Efficiently Over TimeThe Maintenance Horizon: Disposable Code Versus Durable CodeDurable Code Is a Model for Mutating and Maintaining Code in PlaceDisposable Code Is a Model for Avoiding Maintenance CostsThe Maintenance Horizon Is Not Always Knowable Up FrontProduction Quality Code Is a Function of DependabilityIdentifying the Critical PathThe Closer You Get to Persisting Data, the More Cautious You Should BeDevelopment to Production: Tools for Crossing OverPractice 0: Give Yourself the Gift of Rich Data and Precision ToolingPractice 1: Build a Feedback Loop Between Developers and ProductionPractice 2: Test Your Code Before You Deploy ItPractice 3: Instrument Your Code and Validate in ProductionPractice 4: Decouple Deploys from Releases Using Feature FlagsPractice 5: Invest in Progressive Delivery, Canaries, Automated RollbacksBonus Practices: Traffic Splitters, Capture/Replay, Strangler FigsObservability Is the Feedback Loop of Feedback LoopsThe Work of Development Is Not Done Until It’s Working in ProductionConclusion
3. The Origins of Observability in Software
An Introduction from CharityThe Dominant Model of “Observability” Is Just Monitoring, RebrandedWe Lost the Fight To Define ObservabilityProduction Has Become Too Complex for Us to Debug Via IntuitionThe Facebook Experience That Showed Us What’s PossibleThe Control Theory Definition Made It All ClickThe Modern Observability Landscape Is Confusing and Continues to ExpandCosts Are Driving the Need to Change, and AI Is Enabling That ChangeDistilling the Lessons That MatterConclusion
II. Instrumentation Fundamentals
4. Getting Started with Instrumentation
Instrumentation BasicsThe Vocabulary of Telemetry: Logs, Metrics, and TracesOpenTelemetry: The Universal Language for ObservabilityAutomatic Instrumentation Versus Custom InstrumentationOwnership of InstrumentationBuilding a Custom Instrumentation StrategyCost Considerations, Volume Management, and ProcessingProcessing and PipelinesCost Considerations and VolumeEssential Concepts for SamplingConclusion
5. Structured Events Are the Building Blocks of Observability
What Is a Structured Event?The Limitations of MetricsWhat Is a Metric?Metrics Are Typically AggregatedThe Inner Workings of LogsTurning Traditional Logs Into Structured LogsIs a Structured Log the Same as a Structured Event?Tracking a Single OperationThe Inner Workings of Distributed TracesA Brief Introduction to Distributed TracingThe Components of TracingTurning Logs Into Distributed TracesTraces Are Collections of SpansTrace ContextDynamically Generating the Right System ViewsConclusion
6. Making Structured Events Arbitrarily Wide
Service and Code ContextService MetadataBuild InformationFeature FlagsVersions of Important ThingsRequest and Execution FlowHTTP InformationRoute InformationTimingsAsync Request SummariesErrorsUser and Business ContextUser and Customer InformationRate LimitsCachingLocalization InformationOperational InformationUptimeMetricsA Convention To Filter Out Everything ElseAttributes Important to Your Specific ApplicationConclusion

7. Instrumenting Your Code with OpenTelemetry
What It Means to Use OpenTelemetryEffective Instrumentation With OpenTelemetryTrace-First TelemetryTraces For Common Architectural PatternsMetrics, Spans, Logs, Events—Oh My?Using AI Agents To Instrument Your CodeWhat Is An Agent, Anyway?Instrumentation with AgentsUseful Strategies for AgentsConclusion
III. Analysis Workflows
8. Getting Started with Observability Analysis
Debugging from Known ConditionsDebugging from First PrinciplesUsing the Core Analysis LoopAutomating the Brute-Force Portion of the Core Analysis LoopAutomating Analysis with Generative AIAgentic AI PersonasUsing Agentic AI for Observability In PracticeConclusion
9. Observability-Driven Development
Test-Driven DevelopmentObservability in the Development CycleDetermining Where to DebugDebugging in the Time of MicroservicesHow Instrumentation Drives Modern ObservabilityShifting Observability LeftUsing Observability to Speed Up Software DeliveryObservability-Driven Development with AIConclusion
10. The Role of AI Agents for Observability
What Is an AI Agent for Observability?The Pitfalls of Querying Without ContextProven Use Cases for Observability AgentsIncident ResponseExplaining Errors and Patterns in Telemetry DataImproving Instrumentation QualityObservability-Adjacent Use Cases for AgentsThe Production Problem: The Mental Model That’s DisappearingThe Need for ContextConclusion
11. Using Service Level Objectives for Reliability
Threshold-Based Alerting Creates Alert FatigueThreshold Alerting Is Only for Known-UnknownsUser Experience Is a North StarReliable Alerting with Service Level ObjectivesCase Study: Changing Culture Toward SLO-Based AlertsAccelerating SLO Adoption with Generative AIFrom Targets to Implementation: Drafting Service Level IndicatorEncoding Best Practices in Your PromptsConclusion
IV. Observability Technical Deep Dives
12. Acting on and Debugging SLO-Based Alerts
Alerting Before Your Error Budget Is EmptyFraming Time as a Sliding WindowForecasting to Create a Predictive Burn AlertThreshold-Crossing AlertsRelative Burn AlertsPredictive Burn AlertsThe Lookahead WindowThe Baseline WindowActing on SLO Burn AlertsUsing Structured Event Data for SLOs Versus Time-Series DataConclusion
13. Efficient Data Storage with Retriever
The Functional Requirements for ObservabilityTime-Series Databases Are Inadequate for Unified ObservabilityOther Possible DatastoresData Storage StrategiesCase Study: The Implementation of RetrieverPartitioning Data by TimeStoring Data by Column Within SegmentsPerforming Query WorkloadsQueries on Parts of Fields and Aggregated FieldsQuerying for TracesJoinsQuerying Data in Real TimeMaking It Affordable with TieringMaking It Fast with ParallelismDealing with High CardinalityScaling and Durability StrategiesNotes on Building Your Own Efficient DatastoreConclusion
14. Efficient Data Storage with ClickHouse
ClickHouse Core ConceptsMergeTree FundamentalsQuery ExecutionCollecting DataThe OpenTelemetry Collector and ClickHouse ExporterAlternate Ingestion ApproachesQuerying DataSimple QueriesComplex JoinsScaling VerticallyProfiling Your QueriesPractical Data Modelling and OptimizationData Lifecycle ManagementScaling HorizontallyReplicationShardingSingle-Region Observability ClusterMultiregion Observability ClusterSharedMergeTreeVisualizing Your DataUsing ClickHouse for Observability WorkloadsConclusion
15. Cheap and Accurate Enough Sampling
Sampling to Refine Your Data CollectionUsing Different Approaches to SamplingConstant-Probability SamplingSampling on Recent Traffic VolumeSampling Based on Event Content (Keys)Combining per Key and Historical MethodsChoosing Dynamic Sampling OptionsWhen to Make a Sampling Decision for TracesTranslating Sampling Strategies into CodeThe Base CaseFixed-Rate SamplingRecording the Sample RateConsistent SamplingTarget-Rate SamplingHaving More Than One Static Sample RateSampling by Key and Target RateSampling with Dynamic Rates on Arbitrarily Many KeysPutting It All Together: Head and Tail per Key Target-Rate SamplingConclusion
16. Telemetry Management with Pipelines
An Introduction to Telemetry PipelinesThe Telemetry Pipeline SolutionWhy This Matters NowCore Functions of a Modern Telemetry PipelineCollectNormalize and SecureEnrichReduceRouteData ResiliencePipeline Control and ObservabilityWhat You Should Remember About Core FunctionsPipeline Architecture in PracticeCore ComponentsDeployment PatternsScaling and Performance in PracticeBuild-Versus-Buy ConsiderationsWhat You Should Remember About Pipeline ArchitectureAdoption and MigrationA Phased Path to AdoptionWhat You Should Remember About Adoption and MigrationUse Cases: Collect and Reduce, CombinedBusiness CaseCost ControlVendor NeutralityTelemetry as a Strategic AssetThe Role of Telemetry PipelinesConclusion
17. Ontologies as a Shared Language for Humans and AI
Ontologies and Their Role in ObservabilityDesign the OntologyDefine the Core Entities (the Nouns)Define the Invariants (the Rules)Visualize the Semantic Grammar of the DomainGlue the Schema Through MetadataSchematize Intent with the ActionPlanValidate the Contract: Continuous IntegrationLayer Three Gates for DefenseEstablish the Team WorkflowCreate Signal Parity in Production: Continuous DeploymentUnderstand the Hierarchy of SignalsCreate Shared Instrumentation: The Universal PayloadImplement the AI Sandwich ArchitectureClose the Loop: Production Driving TestsPutting Ontologies into PracticeConclusion
V. Observability Use Cases
18. Observability for CI/CD Pipelines
Why Reliable and Fast CI/CD MattersBuild Observability Has the Best ROI of All Observability ApplicationsCI/CD Through the Lens of ObservabilityThe Ontology of Continuous Integration and DeploymentInstrumentation BasicsDefining Service Level Indicators and Achieving PredictabilityFrom Jobs, to Directed Acyclic Graphs and TracesMaking Improvements and Measuring ThemUnderstanding Performance: Treating Continuous Integration Like ProductionPredictability: Real-World Trade-Offs(The Lack of) Incrementality: The Bane of Continuous Integration’s ExistenceKeeping Your Build Performance Tight and Your Developers HappyCase Study: The Importance of Quick Build Times at HoneycombHistory of Improving Build Times at HoneycombApplying These Lessons to Your Build SystemConclusion
19. Observability for Mobile and Frontend
Status Quo for Mobile and FrontendInstrumentation LimitationsOvercoming Domain ChallengesRecognizing the ProblemInstrumentation DifficultiesApplying Local Storage and Real-Time Control to Mobile ObservabilityWhat to Observe and WhyOpportunities for ImprovementMaking Observability User-FocusedQuantifying User ExperienceAdapting Existing ApproachesApplying User-Focused ObservabilityIterative Analysis ProcessExample: Food Delivery AppMobile and Frontend Applications Deserve ObservabilityConclusion
20. Performance Engineering with Observability
The Case for Performance EngineeringBuilding a Performance Engineering PracticeOptimizing Cost Without Modifying Application CodeInfrastructure Purchasing ModelsFleet-Wide OptimizationCost Optimizing KubernetesCost Optimizing ServerlessObserving the CostsHow Application Observability Reduces CostUsing CPU Profiling ToolsUsing the Correct Observability Signals, TogetherConclusion
21. Observability for Large Language Models
Why Observability Matters for LLMsUsing Evaluations for LLM ReliabilityDesigning Telemetry for LLM ApplicationsAnalyzing Telemetry for AI ApplicationsFeeding Observability Data Into AI Application DevelopmentUsing Evaluations and Observability TogetherConclusion
22. An Intercom Case Study in Modern Engineering
Increasing Resolution RateSpeeding Up Without Losing EfficiencyAnd Thus, “Time to First Token” Was BornSpeeding Fin UpFinance Enters the ChatEmpathyConclusion
VI. Observability Governance
23. Organizational Learning Speed is Now Your Biggest Constraint: An Open Letter to CTOs
An Open Letter to CTOsThe Sociotechnical Debts That Will Hold You BackWhen Developers Live In a World of Tests, Not RealityWhen Telemetry Gets Treated Like Infrastructure, Not ProductWhen Developer Tools Were Never Designed as ProductsThese Debts Will Sabotage Your Adoption of AITurning the Ship AroundMeasuring ValueBuild Good Feedback LoopsChange Actions That Drive this StrategyA Letter From a CTO: How Intercom Engineering Optimizes for Learning Conclusion
24. Systems Thinking for Software Delivery
Sociotechnical SystemsYou Must Optimize Both the Social and Technical At the Same TimeChanging Information FlowsHow Feedback Loops Drive ChangeAmplifying Feedback Loops Accelerate ChangeBalancing Feedback Loops Create Stability and EquilibriumSmall Shifts Can Trigger Massive ChangesThe Difference Between a Virtuous Cycle and a Death Spiral is the Ability to Self-CorrectFeedback Loops in Software Delivery SystemsObservability in Amplifying LoopsObservability in Balancing LoopsPartial Feedback Creates Systemic DistortionsLeverage Points in Sociotechnical SystemsLeverage Points and the Limits of ObservabilityPush in the Right DirectionConclusion
25. The Observability Landscape Through a Systems Lens
The Landscape Feels Noisy Because the Labels Are NoisyThe Loops Most Organizations Run Today (and What Is Missing)Development Feedback LoopsOperational Feedback LoopsThe Missing LoopThe Feedback Loop for Value CreationShipping Is Your HeartbeatHow to Build the Loop and Close the GapWhy Closing the Gap Is Rare: The Economics of Cognitive LoadTwo Observability Models, Two Feedback LoopsThree-Pillars Model (Built for Operational Outcomes)Unified Storage Model (Built for Developer Learning)AI Changes the Game and Opens New Interaction ModelsBoth Feedback Loops Matter: Lead with the Right OneAlign on the Outcomes You’re Building TowardConclusion
26. The Business Case for Observability
Identifying Your PrioritiesComplementary Roles of the Two Feedback LoopsRead the Fire CodesThe Business Case for Operational LoopsMapping Observability Models to Operational OutcomesThe Operational MandateThe Business Case for Developer Learning LoopsEstimating the Value of Developer LearningThe Developer Mandate Has Two HalvesStrategic Investment or Cost Center Optimization?Observability as Organizational InvestmentConclusion
27. Diagnosing Your Observability Investment
Don’t Pay Observability Prices for Monitoring OutcomesThe Firefighting TrapAn Investment Posture MismatchActivities Versus LearningHow to Know If Your Investment Is Working (or Not)Advice for Observability InvestmentsConclusion
28. The Organizational Shift
Recognizing Legacy Masquerading as Modern ObservabilityThe Ownership TestThe Two-or-Three People TestThe Mystery TestThe Arbitrary Question TestThe Deployment Confidence TestAfter the TestsUnderstanding the ResistanceThe Organizational Immune ResponseWhy Legacy Vendors FightWhy Existing Teams ResistWhy Leadership Is SkepticalWhy You Cannot Fight It AloneBuilding the CoalitionFind the People Whose Pain Is ImmediateHonor the Heroes Who Held It TogetherWhat Makes a Good AllyBuilding the Case TogetherSecuring the MandateSponsorship Versus AuthorityFinding the Right ExecutiveThe Pitch That WorksThe AI Forcing FunctionThe RoadmapStart Small: One Domain, Full DepthPave the Path: Make the New Way EasierPlatform Engineering PrinciplesDemonstrate Wins: Rerun the TestsExpand DeliberatelyBuilding the TeamWhat the Observability Team OwnsCapabilities of the Observability TeamBringing the Existing Team AlongBreaking the Vendor Identity TrapWhat Not to PromiseConclusion
29. Build Versus Buy (Versus Open Source)
Which Do You Hate More: Your Money or Your Time?Surfacing Our BiasesObservability Is Not Like Other SoftwareA Simple Framework for Evaluating “Build Versus Buy”The Four Quadrants: Build, Buy, Should Not Exist, or DecideMapping Examples to QuadrantsThe Generalized PatternWhat About Open Source?A Framework That Includes Open SourceThe Four Quadrants: Avoid, Create, Adopt, AcquireThere’s Always Vibe CodingThe Real Costs of Building Your OwnThe Hidden Costs of “Free” SoftwareThe Benefits of Building Your OwnThe Risks of Building Your OwnThe Real Costs of Buying SoftwareThe Hidden Financial Costs of Commercial SoftwareThe Hidden Nonfinancial Costs of Commercial SoftwareThe Benefits of Buying Commercial SoftwareThe Risks of Buying Commercial SoftwareBuild Versus Buy Is Not a Binary ChoiceThis Is Where Your Observability Team Comes InGet the Best of Both WorldsImportant CaveatsConclusion
30. The Art and Science of Vendor Partnerships
Does Buying Software Even Count as Engineering?The Difference Between Vendor Engineering and Buying SoftwareTraining Grounds for Executive SkillsWhat to Know Before You StartFind Local GuidanceUnderstand Your RoleMapping Your StakeholdersEngineeringManagementFinanceIT/Ops, and SecurityProcurementCommunicating up the LadderTrust and Credibility with Internal StakeholdersCredibility Versus TrustBuilding Trust with StakeholdersBreaking Trust with StakeholdersTrust and Reciprocity with External StakeholdersBuilding Vendor RelationshipsThe Vendor Trust CoefficientThe Reciprocity PrincipleHow to Influence Another Company’s RoadmapSLAs Encode the Vendor–Customer RelationshipEarly-Stage ConsiderationsDefine the Scope of WorkIdentify Your Executive SponsorDefine Success Criteria Before You StartWhat to Do When Your Coworkers Perceive You as a ThreatThe OpenTelemetry DecisionDesign and Run a Meaningful Proof of ConceptWhat Does a Good Proof of Concept Look Like?Structure Your Technical EvaluationValidate Pricing at Scale (If Cost Is a Concern)Build Champions and Advertise Your WinsMaintain Momentum Through ProcurementMigration and Follow-ThroughHanding Off to Another TeamMost Migrations Have Three StepsYou Aren’t Done Until You DecommissionPlan for the Next TransitionConclusion
31. Instrumentation for Observability Teams
What Observability Teams Need To KnowSemantic ConventionsTelemetry Schemas in PracticeTooling, Migrations, and Paved PathsTools and FrameworksPaved Paths and Special HatsInstrumentation in Highly Secure EnvironmentsUse Compliance Tiers to Your AdvantageA Minimal Set of Capabilities for Observability TeamsValuable Capabilities In a Regulated EnvironmentThis Is Hard Work, But Work Worth DoingConclusion
32. Where Do We Go From Here?
Developers Must Return to ProductionThe Changing Cost Model of SoftwareThe Most Important Parts of Our System Have Never Been SpecifiedThe Great UnbundlingThe Human Cost of Compressed ChangeThe Claim: “AI Is Unprecedented”This Moment Has Many PrecedentsThe Sysadmins Went Through ThisFrom Software Code to Software SystemsValues Backed By DurabilityThe Tools to Build Durable Systems Will ComeConclusion
Index
About the Authors

Content preview from Observability Engineering, 2nd Edition

Chapter 12. Acting on and Debugging SLO-Based Alerts

In the preceding chapter, we introduced SLOs and an SLO-based approach to monitoring for more effective alerting. This chapter closely examines how observability data is used to make those alerts both actionable and debuggable. SLOs that use threshold-based monitoring data—or metrics—create alerts that are not actionable because they don’t provide guidance on fixing the underlying issue. However, using wide, structured event data for SLOs makes them both more precise and more debuggable.

Regardless of the degree of observability in your systems, using SLOs to drive alerting can be a productive way to make alerting less noisy and more actionable. SLIs can be defined to measure customer experience of a service in ways that directly align with business objectives. Error budgets set clear expectations between business stakeholders and engineering teams. Error budget burn alerts enable teams to ensure a high degree of customer satisfaction, align with business goals, and initiate an appropriate response to production issues—without the type of cacophony common in symptom-based alerting, where an excessive alert storm is the norm.

In this chapter, we dive deep into SLOs to examine error budgets and the mechanisms available to trigger SLO-based alerts. We’ll break down what an SLO error budget is and how it works; the forecasting calculations available for predicting SLO error budget exhaustion; and why it’s necessary to use wide, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098179915Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Observability Engineering, 2nd Edition

by Charity Majors, Liz Fong-Jones, George Miranda

Chapter 12. Acting on and Debugging SLO-Based Alerts

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.