Observability Engineering

Book description

Observability is critical for building, changing, and understanding the software that powers complex modern systems. Teams that adopt observability are much better equipped to ship code swiftly and confidently, identify outliers and aberrant behaviors, and understand the experience of each and every user. This practical book explains the value of observable systems and shows you how to practice observability-driven development.

Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what you're doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. You'll also learn the impact observability has on organizational culture (and vice versa).

You'll explore:

  • How the concept of observability applies to managing software at scale
  • The value of practicing observability when delivering complex cloud native applications and systems
  • The impact observability has across the entire software development lifecycle
  • How and why different functional teams use observability with service-level objectives
  • How to instrument your code to help future engineers understand the code you wrote today
  • How to produce quality code for context-aware system debugging and maintenance
  • How data-rich analytics can help you debug elusive issues

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Who This Book Is For
    2. Why We Wrote This Book
    3. What You Will Learn
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  3. I. The Path to Observability
  4. 1. What Is Observability?
    1. The Mathematical Definition of Observability
    2. Applying Observability to Software Systems
    3. Mischaracterizations About Observability for Software
    4. Why Observability Matters Now
      1. Is This Really the Best Way?
      2. Why Are Metrics and Monitoring Not Enough?
    5. Debugging with Metrics Versus Observability
      1. The Role of Cardinality
      2. The Role of Dimensionality
    6. Debugging with Observability
    7. Observability Is for Modern Systems
    8. Conclusion
  5. 2. How Debugging Practices Differ Between Observability and Monitoring
    1. How Monitoring Data Is Used for Debugging
      1. Troubleshooting Behaviors When Using Dashboards
      2. The Limitations of Troubleshooting by Intuition
      3. Traditional Monitoring Is Fundamentally Reactive
    2. How Observability Enables Better Debugging
    3. Conclusion
  6. 3. Lessons from Scaling Without Observability
    1. An Introduction to Parse
    2. Scaling at Parse
    3. The Evolution Toward Modern Systems
    4. The Evolution Toward Modern Practices
    5. Shifting Practices at Parse
    6. Conclusion
  7. 4. How Observability Relates to DevOps, SRE, and Cloud Native
    1. Cloud Native, DevOps, and SRE in a Nutshell
    2. Observability: Debugging Then Versus Now
    3. Observability Empowers DevOps and SRE Practices
    4. Conclusion
  8. II. Fundamentals of Observability
  9. 5. Structured Events Are the Building Blocks of Observability
    1. Debugging with Structured Events
    2. The Limitations of Metrics as a Building Block
    3. The Limitations of Traditional Logs as a Building Block
      1. Unstructured Logs
      2. Structured Logs
    4. Properties of Events That Are Useful in Debugging
    5. Conclusion
  10. 6. Stitching Events into Traces
    1. Distributed Tracing and Why It Matters Now
    2. The Components of Tracing
    3. Instrumenting a Trace the Hard Way
    4. Adding Custom Fields into Trace Spans
    5. Stitching Events into Traces
    6. Conclusion
  11. 7. Instrumentation with OpenTelemetry
    1. A Brief Introduction to Instrumentation
    2. Open Instrumentation Standards
    3. Instrumentation Using Code-Based Examples
      1. Start with Automatic Instrumentation
      2. Add Custom Instrumentation
      3. Send Instrumentation Data to a Backend System
    4. Conclusion
  12. 8. Analyzing Events to Achieve Observability
    1. Debugging from Known Conditions
    2. Debugging from First Principles
      1. Using the Core Analysis Loop
      2. Automating the Brute-Force Portion of the Core Analysis Loop
    3. This Misleading Promise of AIOps
    4. Conclusion
  13. 9. How Observability and Monitoring Come Together
    1. Where Monitoring Fits
    2. Where Observability Fits
    3. System Versus Software Considerations
    4. Assessing Your Organizational Needs
      1. Exceptions: Infrastructure Monitoring That Can’t Be Ignored
      2. Real-World Examples
    5. Conclusion
  14. III. Observability for Teams
  15. 10. Applying Observability Practices in Your Team
    1. Join a Community Group
    2. Start with the Biggest Pain Points
    3. Buy Instead of Build
    4. Flesh Out Your Instrumentation Iteratively
    5. Look for Opportunities to Leverage Existing Efforts
    6. Prepare for the Hardest Last Push
    7. Conclusion
  16. 11. Observability-Driven Development
    1. Test-Driven Development
    2. Observability in the Development Cycle
    3. Determining Where to Debug
    4. Debugging in the Time of Microservices
    5. How Instrumentation Drives Observability
    6. Shifting Observability Left
    7. Using Observability to Speed Up Software Delivery
    8. Conclusion
  17. 12. Using Service-Level Objectives for Reliability
    1. Traditional Monitoring Approaches Create Dangerous Alert Fatigue
    2. Threshold Alerting Is for Known-Unknowns Only
    3. User Experience Is a North Star
    4. What Is a Service-Level Objective?
      1. Reliable Alerting with SLOs
      2. Changing Culture Toward SLO-Based Alerts: A Case Study
    5. Conclusion
  18. 13. Acting on and Debugging SLO-Based Alerts
    1. Alerting Before Your Error Budget Is Empty
    2. Framing Time as a Sliding Window
    3. Forecasting to Create a Predictive Burn Alert
      1. The Lookahead Window
      2. The Baseline Window
      3. Acting on SLO Burn Alerts
    4. Using Observability Data for SLOs Versus Time-Series Data
    5. Conclusion
  19. 14. Observability and the Software Supply Chain
    1. Why Slack Needed Observability
    2. Instrumentation: Shared Client Libraries and Dimensions
    3. Case Studies: Operationalizing the Supply Chain
      1. Understanding Context Through Tooling
      2. Embedding Actionable Alerting
      3. Understanding What Changed
    4. Conclusion
  20. IV. Observability at Scale
  21. 15. Build Versus Buy and Return on Investment
    1. How to Analyze the ROI of Observability
    2. The Real Costs of Building Your Own
      1. The Hidden Costs of Using “Free” Software
      2. The Benefits of Building Your Own
      3. The Risks of Building Your Own
    3. The Real Costs of Buying Software
      1. The Hidden Financial Costs of Commercial Software
      2. The Hidden Nonfinancial Costs of Commercial Software
      3. The Benefits of Buying Commercial Software
      4. The Risks of Buying Commercial Software
    4. Buy Versus Build Is Not a Binary Choice
    5. Conclusion
  22. 16. Efficient Data Storage
    1. The Functional Requirements for Observability
      1. Time-Series Databases Are Inadequate for Observability
      2. Other Possible Data Stores
      3. Data Storage Strategies
    2. Case Study: The Implementation of Honeycomb’s Retriever
      1. Partitioning Data by Time
      2. Storing Data by Column Within Segments
      3. Performing Query Workloads
      4. Querying for Traces
      5. Querying Data in Real Time
      6. Making It Affordable with Tiering
      7. Making It Fast with Parallelism
      8. Dealing with High Cardinality
      9. Scaling and Durability Strategies
      10. Notes on Building Your Own Efficient Data Store
    3. Conclusion
  23. 17. Cheap and Accurate Enough: Sampling
    1. Sampling to Refine Your Data Collection
    2. Using Different Approaches to Sampling
      1. Constant-Probability Sampling
      2. Sampling on Recent Traffic Volume
      3. Sampling Based on Event Content (Keys)
      4. Combining per Key and Historical Methods
      5. Choosing Dynamic Sampling Options
      6. When to Make a Sampling Decision for Traces
    3. Translating Sampling Strategies into Code
      1. The Base Case
      2. Fixed-Rate Sampling
      3. Recording the Sample Rate
      4. Consistent Sampling
      5. Target Rate Sampling
      6. Having More Than One Static Sample Rate
      7. Sampling by Key and Target Rate
      8. Sampling with Dynamic Rates on Arbitrarily Many Keys
      9. Putting It All Together: Head and Tail per Key Target Rate Sampling
    4. Conclusion
  24. 18. Telemetry Management with Pipelines
    1. Attributes of Telemetry Pipelines
      1. Routing
      2. Security and Compliance
      3. Workload Isolation
      4. Data Buffering
      5. Capacity Management
      6. Data Filtering and Augmentation
      7. Data Transformation
      8. Ensuring Data Quality and Consistency
    2. Managing a Telemetry Pipeline: Anatomy
    3. Challenges When Managing a Telemetry Pipeline
      1. Performance
      2. Correctness
      3. Availability
      4. Reliability
      5. Isolation
      6. Data Freshness
    4. Use Case: Telemetry Management at Slack
      1. Metrics Aggregation
      2. Logs and Trace Events
    5. Open Source Alternatives
    6. Managing a Telemetry Pipeline: Build Versus Buy
    7. Conclusion
  25. V. Spreading Observability Culture
  26. 19. The Business Case for Observability
    1. The Reactive Approach to Introducing Change
    2. The Return on Investment of Observability
    3. The Proactive Approach to Introducing Change
    4. Introducing Observability as a Practice
    5. Using the Appropriate Tools
      1. Instrumentation
      2. Data Storage and Analytics
      3. Rolling Out Tools to Your Teams
    6. Knowing When You Have Enough Observability
    7. Conclusion
  27. 20. Observability’s Stakeholders and Allies
    1. Recognizing Nonengineering Observability Needs
    2. Creating Observability Allies in Practice
      1. Customer Support Teams
      2. Customer Success and Product Teams
      3. Sales and Executive Teams
    3. Using Observability Versus Business Intelligence Tools
      1. Query Execution Time
      2. Accuracy
      3. Recency
      4. Structure
      5. Time Windows
      6. Ephemerality
    4. Using Observability and BI Tools Together in Practice
    5. Conclusion
  28. 21. An Observability Maturity Model
    1. A Note About Maturity Models
    2. Why Observability Needs a Maturity Model
    3. About the Observability Maturity Model
    4. Capabilities Referenced in the OMM
      1. Respond to System Failure with Resilience
      2. Deliver High-Quality Code
      3. Manage Complexity and Technical Debt
      4. Release on a Predictable Cadence
      5. Understand User Behavior
    5. Using the OMM for Your Organization
    6. Conclusion
  29. 22. Where to Go from Here
    1. Observability, Then Versus Now
    2. Additional Resources
    3. Predictions for Where Observability Is Going
  30. Index
  31. About the Authors

Product information

  • Title: Observability Engineering
  • Author(s): Charity Majors, Liz Fong-Jones, George Miranda
  • Release date: May 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492076445