book

Observability Engineering

by Charity Majors, Liz Fong-Jones, George Miranda

May 2022

Intermediate to advanced

318 pages

9h 15m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who This Book Is ForWhy We Wrote This BookWhat You Will LearnConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. The Path to Observability
1. What Is Observability?
The Mathematical Definition of ObservabilityApplying Observability to Software SystemsMischaracterizations About Observability for SoftwareWhy Observability Matters NowIs This Really the Best Way?Why Are Metrics and Monitoring Not Enough?Debugging with Metrics Versus ObservabilityThe Role of CardinalityThe Role of DimensionalityDebugging with ObservabilityObservability Is for Modern SystemsConclusion
2. How Debugging Practices Differ Between Observability and Monitoring
How Monitoring Data Is Used for DebuggingTroubleshooting Behaviors When Using DashboardsThe Limitations of Troubleshooting by IntuitionTraditional Monitoring Is Fundamentally ReactiveHow Observability Enables Better DebuggingConclusion
3. Lessons from Scaling Without Observability
An Introduction to ParseScaling at ParseThe Evolution Toward Modern SystemsThe Evolution Toward Modern PracticesShifting Practices at ParseConclusion
4. How Observability Relates to DevOps, SRE, and Cloud Native
Cloud Native, DevOps, and SRE in a NutshellObservability: Debugging Then Versus NowObservability Empowers DevOps and SRE PracticesConclusion
II. Fundamentals of Observability
5. Structured Events Are the Building Blocks of Observability
Debugging with Structured EventsThe Limitations of Metrics as a Building BlockThe Limitations of Traditional Logs as a Building BlockUnstructured LogsStructured LogsProperties of Events That Are Useful in DebuggingConclusion
6. Stitching Events into Traces
Distributed Tracing and Why It Matters NowThe Components of TracingInstrumenting a Trace the Hard WayAdding Custom Fields into Trace SpansStitching Events into TracesConclusion

7. Instrumentation with OpenTelemetry
A Brief Introduction to InstrumentationOpen Instrumentation StandardsInstrumentation Using Code-Based ExamplesStart with Automatic InstrumentationAdd Custom InstrumentationSend Instrumentation Data to a Backend SystemConclusion
8. Analyzing Events to Achieve Observability
Debugging from Known ConditionsDebugging from First PrinciplesUsing the Core Analysis LoopAutomating the Brute-Force Portion of the Core Analysis LoopThis Misleading Promise of AIOpsConclusion
9. How Observability and Monitoring Come Together
Where Monitoring FitsWhere Observability FitsSystem Versus Software ConsiderationsAssessing Your Organizational NeedsExceptions: Infrastructure Monitoring That Can’t Be IgnoredReal-World ExamplesConclusion
III. Observability for Teams
10. Applying Observability Practices in Your Team
Join a Community GroupStart with the Biggest Pain PointsBuy Instead of BuildFlesh Out Your Instrumentation IterativelyLook for Opportunities to Leverage Existing EffortsPrepare for the Hardest Last PushConclusion
11. Observability-Driven Development
Test-Driven DevelopmentObservability in the Development CycleDetermining Where to DebugDebugging in the Time of MicroservicesHow Instrumentation Drives ObservabilityShifting Observability LeftUsing Observability to Speed Up Software DeliveryConclusion
12. Using Service-Level Objectives for Reliability
Traditional Monitoring Approaches Create Dangerous Alert FatigueThreshold Alerting Is for Known-Unknowns OnlyUser Experience Is a North StarWhat Is a Service-Level Objective?Reliable Alerting with SLOsChanging Culture Toward SLO-Based Alerts: A Case StudyConclusion
13. Acting on and Debugging SLO-Based Alerts
Alerting Before Your Error Budget Is EmptyFraming Time as a Sliding WindowForecasting to Create a Predictive Burn AlertThe Lookahead WindowThe Baseline WindowActing on SLO Burn AlertsUsing Observability Data for SLOs Versus Time-Series DataConclusion
14. Observability and the Software Supply Chain
Why Slack Needed ObservabilityInstrumentation: Shared Client Libraries and DimensionsCase Studies: Operationalizing the Supply ChainUnderstanding Context Through ToolingEmbedding Actionable AlertingUnderstanding What ChangedConclusion
IV. Observability at Scale
15. Build Versus Buy and Return on Investment
How to Analyze the ROI of ObservabilityThe Real Costs of Building Your OwnThe Hidden Costs of Using “Free” SoftwareThe Benefits of Building Your OwnThe Risks of Building Your OwnThe Real Costs of Buying SoftwareThe Hidden Financial Costs of Commercial SoftwareThe Hidden Nonfinancial Costs of Commercial SoftwareThe Benefits of Buying Commercial SoftwareThe Risks of Buying Commercial SoftwareBuy Versus Build Is Not a Binary ChoiceConclusion
16. Efficient Data Storage
The Functional Requirements for ObservabilityTime-Series Databases Are Inadequate for ObservabilityOther Possible Data StoresData Storage StrategiesCase Study: The Implementation of Honeycomb’s RetrieverPartitioning Data by TimeStoring Data by Column Within SegmentsPerforming Query WorkloadsQuerying for TracesQuerying Data in Real TimeMaking It Affordable with TieringMaking It Fast with ParallelismDealing with High CardinalityScaling and Durability StrategiesNotes on Building Your Own Efficient Data StoreConclusion
17. Cheap and Accurate Enough: Sampling
Sampling to Refine Your Data CollectionUsing Different Approaches to SamplingConstant-Probability SamplingSampling on Recent Traffic VolumeSampling Based on Event Content (Keys)Combining per Key and Historical MethodsChoosing Dynamic Sampling OptionsWhen to Make a Sampling Decision for TracesTranslating Sampling Strategies into CodeThe Base CaseFixed-Rate SamplingRecording the Sample RateConsistent SamplingTarget Rate SamplingHaving More Than One Static Sample RateSampling by Key and Target RateSampling with Dynamic Rates on Arbitrarily Many KeysPutting It All Together: Head and Tail per Key Target Rate SamplingConclusion
18. Telemetry Management with Pipelines
Attributes of Telemetry PipelinesRoutingSecurity and ComplianceWorkload IsolationData BufferingCapacity ManagementData Filtering and AugmentationData TransformationEnsuring Data Quality and ConsistencyManaging a Telemetry Pipeline: AnatomyChallenges When Managing a Telemetry PipelinePerformanceCorrectnessAvailabilityReliabilityIsolationData FreshnessUse Case: Telemetry Management at SlackMetrics AggregationLogs and Trace EventsOpen Source AlternativesManaging a Telemetry Pipeline: Build Versus BuyConclusion
V. Spreading Observability Culture
19. The Business Case for Observability
The Reactive Approach to Introducing ChangeThe Return on Investment of ObservabilityThe Proactive Approach to Introducing ChangeIntroducing Observability as a PracticeUsing the Appropriate ToolsInstrumentationData Storage and AnalyticsRolling Out Tools to Your TeamsKnowing When You Have Enough ObservabilityConclusion
20. Observability’s Stakeholders and Allies
Recognizing Nonengineering Observability NeedsCreating Observability Allies in PracticeCustomer Support TeamsCustomer Success and Product TeamsSales and Executive TeamsUsing Observability Versus Business Intelligence ToolsQuery Execution TimeAccuracyRecencyStructureTime WindowsEphemeralityUsing Observability and BI Tools Together in PracticeConclusion
21. An Observability Maturity Model
A Note About Maturity ModelsWhy Observability Needs a Maturity ModelAbout the Observability Maturity ModelCapabilities Referenced in the OMMRespond to System Failure with ResilienceDeliver High-Quality CodeManage Complexity and Technical DebtRelease on a Predictable CadenceUnderstand User BehaviorUsing the OMM for Your OrganizationConclusion
22. Where to Go from Here
Observability, Then Versus NowAdditional ResourcesPredictions for Where Observability Is Going
Index
About the Authors

Content preview from Observability Engineering

Chapter 13. Acting on and Debugging SLO-Based Alerts

In the preceding chapter, we introduced SLOs and an SLO-based approach to monitoring that makes for more effective alerting. This chapter closely examines how observability data is used to make those alerts both actionable and debuggable. SLOs that use traditional monitoring data—or metrics—create alerts that are not actionable since they don’t provide guidance on fixing the underlying issue. Further, using observability data for SLOs makes them both more precise and more debuggable.

While independent from practicing observability, using SLOs to drive alerting can be a productive way to make alerting less noisy and more actionable. SLIs can be defined to measure customer experience of a service in ways that directly align with business objectives. Error budgets set clear expectations between business stakeholders and engineering teams. Error budget burn alerts enable teams to ensure a high degree of customer satisfaction, align with business goals, and initiate an appropriate response to production issues without the kind of cacophony that exists in the world of symptom-based alerting, where an excessive alert storm is the norm.

In this chapter, we will examine the role that error budgets play and the mechanisms available to trigger alerts when using SLOs. We’ll look at what an SLO error budget is and how it works, which forecasting calculations are available to predict that your SLO error budget will be exhausted, and why it ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492076438Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Observability Engineering

by Charity Majors, Liz Fong-Jones, George Miranda

Chapter 13. Acting on and Debugging SLO-Based Alerts

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.