book

Observability Engineering

by Charity Majors, Liz Fong-Jones, George Miranda

May 2022

Intermediate to advanced

318 pages

9h 15m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who This Book Is ForWhy We Wrote This BookWhat You Will LearnConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. The Path to Observability
1. What Is Observability?
The Mathematical Definition of ObservabilityApplying Observability to Software SystemsMischaracterizations About Observability for SoftwareWhy Observability Matters NowIs This Really the Best Way?Why Are Metrics and Monitoring Not Enough?Debugging with Metrics Versus ObservabilityThe Role of CardinalityThe Role of DimensionalityDebugging with ObservabilityObservability Is for Modern SystemsConclusion
2. How Debugging Practices Differ Between Observability and Monitoring
How Monitoring Data Is Used for DebuggingTroubleshooting Behaviors When Using DashboardsThe Limitations of Troubleshooting by IntuitionTraditional Monitoring Is Fundamentally ReactiveHow Observability Enables Better DebuggingConclusion
3. Lessons from Scaling Without Observability
An Introduction to ParseScaling at ParseThe Evolution Toward Modern SystemsThe Evolution Toward Modern PracticesShifting Practices at ParseConclusion
4. How Observability Relates to DevOps, SRE, and Cloud Native
Cloud Native, DevOps, and SRE in a NutshellObservability: Debugging Then Versus NowObservability Empowers DevOps and SRE PracticesConclusion
II. Fundamentals of Observability
5. Structured Events Are the Building Blocks of Observability
Debugging with Structured EventsThe Limitations of Metrics as a Building BlockThe Limitations of Traditional Logs as a Building BlockUnstructured LogsStructured LogsProperties of Events That Are Useful in DebuggingConclusion
6. Stitching Events into Traces
Distributed Tracing and Why It Matters NowThe Components of TracingInstrumenting a Trace the Hard WayAdding Custom Fields into Trace SpansStitching Events into TracesConclusion

7. Instrumentation with OpenTelemetry
A Brief Introduction to InstrumentationOpen Instrumentation StandardsInstrumentation Using Code-Based ExamplesStart with Automatic InstrumentationAdd Custom InstrumentationSend Instrumentation Data to a Backend SystemConclusion
8. Analyzing Events to Achieve Observability
Debugging from Known ConditionsDebugging from First PrinciplesUsing the Core Analysis LoopAutomating the Brute-Force Portion of the Core Analysis LoopThis Misleading Promise of AIOpsConclusion
9. How Observability and Monitoring Come Together
Where Monitoring FitsWhere Observability FitsSystem Versus Software ConsiderationsAssessing Your Organizational NeedsExceptions: Infrastructure Monitoring That Can’t Be IgnoredReal-World ExamplesConclusion
III. Observability for Teams
10. Applying Observability Practices in Your Team
Join a Community GroupStart with the Biggest Pain PointsBuy Instead of BuildFlesh Out Your Instrumentation IterativelyLook for Opportunities to Leverage Existing EffortsPrepare for the Hardest Last PushConclusion
11. Observability-Driven Development
Test-Driven DevelopmentObservability in the Development CycleDetermining Where to DebugDebugging in the Time of MicroservicesHow Instrumentation Drives ObservabilityShifting Observability LeftUsing Observability to Speed Up Software DeliveryConclusion
12. Using Service-Level Objectives for Reliability
Traditional Monitoring Approaches Create Dangerous Alert FatigueThreshold Alerting Is for Known-Unknowns OnlyUser Experience Is a North StarWhat Is a Service-Level Objective?Reliable Alerting with SLOsChanging Culture Toward SLO-Based Alerts: A Case StudyConclusion
13. Acting on and Debugging SLO-Based Alerts
Alerting Before Your Error Budget Is EmptyFraming Time as a Sliding WindowForecasting to Create a Predictive Burn AlertThe Lookahead WindowThe Baseline WindowActing on SLO Burn AlertsUsing Observability Data for SLOs Versus Time-Series DataConclusion
14. Observability and the Software Supply Chain
Why Slack Needed ObservabilityInstrumentation: Shared Client Libraries and DimensionsCase Studies: Operationalizing the Supply ChainUnderstanding Context Through ToolingEmbedding Actionable AlertingUnderstanding What ChangedConclusion
IV. Observability at Scale
15. Build Versus Buy and Return on Investment
How to Analyze the ROI of ObservabilityThe Real Costs of Building Your OwnThe Hidden Costs of Using “Free” SoftwareThe Benefits of Building Your OwnThe Risks of Building Your OwnThe Real Costs of Buying SoftwareThe Hidden Financial Costs of Commercial SoftwareThe Hidden Nonfinancial Costs of Commercial SoftwareThe Benefits of Buying Commercial SoftwareThe Risks of Buying Commercial SoftwareBuy Versus Build Is Not a Binary ChoiceConclusion
16. Efficient Data Storage
The Functional Requirements for ObservabilityTime-Series Databases Are Inadequate for ObservabilityOther Possible Data StoresData Storage StrategiesCase Study: The Implementation of Honeycomb’s RetrieverPartitioning Data by TimeStoring Data by Column Within SegmentsPerforming Query WorkloadsQuerying for TracesQuerying Data in Real TimeMaking It Affordable with TieringMaking It Fast with ParallelismDealing with High CardinalityScaling and Durability StrategiesNotes on Building Your Own Efficient Data StoreConclusion
17. Cheap and Accurate Enough: Sampling
Sampling to Refine Your Data CollectionUsing Different Approaches to SamplingConstant-Probability SamplingSampling on Recent Traffic VolumeSampling Based on Event Content (Keys)Combining per Key and Historical MethodsChoosing Dynamic Sampling OptionsWhen to Make a Sampling Decision for TracesTranslating Sampling Strategies into CodeThe Base CaseFixed-Rate SamplingRecording the Sample RateConsistent SamplingTarget Rate SamplingHaving More Than One Static Sample RateSampling by Key and Target RateSampling with Dynamic Rates on Arbitrarily Many KeysPutting It All Together: Head and Tail per Key Target Rate SamplingConclusion
18. Telemetry Management with Pipelines
Attributes of Telemetry PipelinesRoutingSecurity and ComplianceWorkload IsolationData BufferingCapacity ManagementData Filtering and AugmentationData TransformationEnsuring Data Quality and ConsistencyManaging a Telemetry Pipeline: AnatomyChallenges When Managing a Telemetry PipelinePerformanceCorrectnessAvailabilityReliabilityIsolationData FreshnessUse Case: Telemetry Management at SlackMetrics AggregationLogs and Trace EventsOpen Source AlternativesManaging a Telemetry Pipeline: Build Versus BuyConclusion
V. Spreading Observability Culture
19. The Business Case for Observability
The Reactive Approach to Introducing ChangeThe Return on Investment of ObservabilityThe Proactive Approach to Introducing ChangeIntroducing Observability as a PracticeUsing the Appropriate ToolsInstrumentationData Storage and AnalyticsRolling Out Tools to Your TeamsKnowing When You Have Enough ObservabilityConclusion
20. Observability’s Stakeholders and Allies
Recognizing Nonengineering Observability NeedsCreating Observability Allies in PracticeCustomer Support TeamsCustomer Success and Product TeamsSales and Executive TeamsUsing Observability Versus Business Intelligence ToolsQuery Execution TimeAccuracyRecencyStructureTime WindowsEphemeralityUsing Observability and BI Tools Together in PracticeConclusion
21. An Observability Maturity Model
A Note About Maturity ModelsWhy Observability Needs a Maturity ModelAbout the Observability Maturity ModelCapabilities Referenced in the OMMRespond to System Failure with ResilienceDeliver High-Quality CodeManage Complexity and Technical DebtRelease on a Predictable CadenceUnderstand User BehaviorUsing the OMM for Your OrganizationConclusion
22. Where to Go from Here
Observability, Then Versus NowAdditional ResourcesPredictions for Where Observability Is Going
Index
About the Authors

Content preview from Observability Engineering

Part IV. Observability at Scale

In Part III, we focused on overcoming barriers to getting started and new workflows that help change social and cultural practices in order to put some momentum behind your observability adoption initiatives. In this part, we examine considerations on the other end of the adoption spectrum: what happens when observability adoption is successful and practiced at scale?

When it comes to observability, “at scale” is probably larger than most people think. As a rough ballpark measure, when measuring telemetry events generated per day in the high hundreds of millions or low billions, you might have a scale issue. The concepts explored in this chapter are most acutely felt when operating observability solutions at scale. However, these lessons are generally useful to anyone going down the path of observability.

Chapter 15 explores the decision of whether to buy or build an observability solution. At a large enough scale, as the bill for commercial solutions grows, teams will start to consider whether they can save more by simply building an observability solution themselves. This chapter provides guidance on how best to approach that decision.

Chapter 16 explores how a data store must be configured in order to serve the needs of an observability workload. To achieve the functional requirements of iterative and open-ended investigations, several technical criteria must be met. This chapter presents a case study of Honeycomb’s Retriever engine as a model ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492076438Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Observability Engineering

by Charity Majors, Liz Fong-Jones, George Miranda

Part IV. Observability at Scale

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.