book

Observability Engineering

by Charity Majors, Liz Fong-Jones, George Miranda

May 2022

Intermediate to advanced

318 pages

9h 15m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who This Book Is ForWhy We Wrote This BookWhat You Will LearnConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. The Path to Observability
1. What Is Observability?
The Mathematical Definition of ObservabilityApplying Observability to Software SystemsMischaracterizations About Observability for SoftwareWhy Observability Matters NowIs This Really the Best Way?Why Are Metrics and Monitoring Not Enough?Debugging with Metrics Versus ObservabilityThe Role of CardinalityThe Role of DimensionalityDebugging with ObservabilityObservability Is for Modern SystemsConclusion
2. How Debugging Practices Differ Between Observability and Monitoring
How Monitoring Data Is Used for DebuggingTroubleshooting Behaviors When Using DashboardsThe Limitations of Troubleshooting by IntuitionTraditional Monitoring Is Fundamentally ReactiveHow Observability Enables Better DebuggingConclusion
3. Lessons from Scaling Without Observability
An Introduction to ParseScaling at ParseThe Evolution Toward Modern SystemsThe Evolution Toward Modern PracticesShifting Practices at ParseConclusion
4. How Observability Relates to DevOps, SRE, and Cloud Native
Cloud Native, DevOps, and SRE in a NutshellObservability: Debugging Then Versus NowObservability Empowers DevOps and SRE PracticesConclusion
II. Fundamentals of Observability
5. Structured Events Are the Building Blocks of Observability
Debugging with Structured EventsThe Limitations of Metrics as a Building BlockThe Limitations of Traditional Logs as a Building BlockUnstructured LogsStructured LogsProperties of Events That Are Useful in DebuggingConclusion
6. Stitching Events into Traces
Distributed Tracing and Why It Matters NowThe Components of TracingInstrumenting a Trace the Hard WayAdding Custom Fields into Trace SpansStitching Events into TracesConclusion

7. Instrumentation with OpenTelemetry
A Brief Introduction to InstrumentationOpen Instrumentation StandardsInstrumentation Using Code-Based ExamplesStart with Automatic InstrumentationAdd Custom InstrumentationSend Instrumentation Data to a Backend SystemConclusion
8. Analyzing Events to Achieve Observability
Debugging from Known ConditionsDebugging from First PrinciplesUsing the Core Analysis LoopAutomating the Brute-Force Portion of the Core Analysis LoopThis Misleading Promise of AIOpsConclusion
9. How Observability and Monitoring Come Together
Where Monitoring FitsWhere Observability FitsSystem Versus Software ConsiderationsAssessing Your Organizational NeedsExceptions: Infrastructure Monitoring That Can’t Be IgnoredReal-World ExamplesConclusion
III. Observability for Teams
10. Applying Observability Practices in Your Team
Join a Community GroupStart with the Biggest Pain PointsBuy Instead of BuildFlesh Out Your Instrumentation IterativelyLook for Opportunities to Leverage Existing EffortsPrepare for the Hardest Last PushConclusion
11. Observability-Driven Development
Test-Driven DevelopmentObservability in the Development CycleDetermining Where to DebugDebugging in the Time of MicroservicesHow Instrumentation Drives ObservabilityShifting Observability LeftUsing Observability to Speed Up Software DeliveryConclusion
12. Using Service-Level Objectives for Reliability
Traditional Monitoring Approaches Create Dangerous Alert FatigueThreshold Alerting Is for Known-Unknowns OnlyUser Experience Is a North StarWhat Is a Service-Level Objective?Reliable Alerting with SLOsChanging Culture Toward SLO-Based Alerts: A Case StudyConclusion
13. Acting on and Debugging SLO-Based Alerts
Alerting Before Your Error Budget Is EmptyFraming Time as a Sliding WindowForecasting to Create a Predictive Burn AlertThe Lookahead WindowThe Baseline WindowActing on SLO Burn AlertsUsing Observability Data for SLOs Versus Time-Series DataConclusion
14. Observability and the Software Supply Chain
Why Slack Needed ObservabilityInstrumentation: Shared Client Libraries and DimensionsCase Studies: Operationalizing the Supply ChainUnderstanding Context Through ToolingEmbedding Actionable AlertingUnderstanding What ChangedConclusion
IV. Observability at Scale
15. Build Versus Buy and Return on Investment
How to Analyze the ROI of ObservabilityThe Real Costs of Building Your OwnThe Hidden Costs of Using “Free” SoftwareThe Benefits of Building Your OwnThe Risks of Building Your OwnThe Real Costs of Buying SoftwareThe Hidden Financial Costs of Commercial SoftwareThe Hidden Nonfinancial Costs of Commercial SoftwareThe Benefits of Buying Commercial SoftwareThe Risks of Buying Commercial SoftwareBuy Versus Build Is Not a Binary ChoiceConclusion
16. Efficient Data Storage
The Functional Requirements for ObservabilityTime-Series Databases Are Inadequate for ObservabilityOther Possible Data StoresData Storage StrategiesCase Study: The Implementation of Honeycomb’s RetrieverPartitioning Data by TimeStoring Data by Column Within SegmentsPerforming Query WorkloadsQuerying for TracesQuerying Data in Real TimeMaking It Affordable with TieringMaking It Fast with ParallelismDealing with High CardinalityScaling and Durability StrategiesNotes on Building Your Own Efficient Data StoreConclusion
17. Cheap and Accurate Enough: Sampling
Sampling to Refine Your Data CollectionUsing Different Approaches to SamplingConstant-Probability SamplingSampling on Recent Traffic VolumeSampling Based on Event Content (Keys)Combining per Key and Historical MethodsChoosing Dynamic Sampling OptionsWhen to Make a Sampling Decision for TracesTranslating Sampling Strategies into CodeThe Base CaseFixed-Rate SamplingRecording the Sample RateConsistent SamplingTarget Rate SamplingHaving More Than One Static Sample RateSampling by Key and Target RateSampling with Dynamic Rates on Arbitrarily Many KeysPutting It All Together: Head and Tail per Key Target Rate SamplingConclusion
18. Telemetry Management with Pipelines
Attributes of Telemetry PipelinesRoutingSecurity and ComplianceWorkload IsolationData BufferingCapacity ManagementData Filtering and AugmentationData TransformationEnsuring Data Quality and ConsistencyManaging a Telemetry Pipeline: AnatomyChallenges When Managing a Telemetry PipelinePerformanceCorrectnessAvailabilityReliabilityIsolationData FreshnessUse Case: Telemetry Management at SlackMetrics AggregationLogs and Trace EventsOpen Source AlternativesManaging a Telemetry Pipeline: Build Versus BuyConclusion
V. Spreading Observability Culture
19. The Business Case for Observability
The Reactive Approach to Introducing ChangeThe Return on Investment of ObservabilityThe Proactive Approach to Introducing ChangeIntroducing Observability as a PracticeUsing the Appropriate ToolsInstrumentationData Storage and AnalyticsRolling Out Tools to Your TeamsKnowing When You Have Enough ObservabilityConclusion
20. Observability’s Stakeholders and Allies
Recognizing Nonengineering Observability NeedsCreating Observability Allies in PracticeCustomer Support TeamsCustomer Success and Product TeamsSales and Executive TeamsUsing Observability Versus Business Intelligence ToolsQuery Execution TimeAccuracyRecencyStructureTime WindowsEphemeralityUsing Observability and BI Tools Together in PracticeConclusion
21. An Observability Maturity Model
A Note About Maturity ModelsWhy Observability Needs a Maturity ModelAbout the Observability Maturity ModelCapabilities Referenced in the OMMRespond to System Failure with ResilienceDeliver High-Quality CodeManage Complexity and Technical DebtRelease on a Predictable CadenceUnderstand User BehaviorUsing the OMM for Your OrganizationConclusion
22. Where to Go from Here
Observability, Then Versus NowAdditional ResourcesPredictions for Where Observability Is Going
Index
About the Authors

Content preview from Observability Engineering

Chapter 12. Using Service-Level Objectives for Reliability

While observability and traditional monitoring can coexist, observability unlocks the potential to use more sophisticated and complementary approaches to monitoring. The next two chapters will show you how practicing observability and service-level objectives (SLOs) together can improve the reliability of your systems.

In this chapter, you will learn about the common problems that traditional threshold-based monitoring approaches create for your team, how distributed systems exacerbate those problems, and how using an SLO-based approach to monitoring instead solves those problems. We’ll conclude with a real-world example of replacing traditional threshold-based alerting with SLOs. And in Chapter 13, we’ll examine how observability makes your SLO-based alerts actionable and debuggable.

Let’s begin with understanding the role of monitoring and alerting and the previous approaches to them.

Traditional Monitoring Approaches Create Dangerous Alert Fatigue

In monitoring-based approaches, alerts often measure the things that are easiest to measure. Metrics are used to track simplistic system states that might indicate a service’s underlying process(es) may be running poorly or may be a leading indicator of troubles ahead. These states might, for example, trigger an alert if CPU is above 80%, or if available memory is below 10%, or if disk space is nearly full, or if more than x many threads are running, or any set of other simplistic ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492076438Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Observability Engineering

by Charity Majors, Liz Fong-Jones, George Miranda

Chapter 12. Using Service-Level Objectives for Reliability

Traditional Monitoring Approaches Create Dangerous Alert Fatigue

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.