Observability Engineering

Book description

Observability is critical for engineering, managing, and improving complex business-critical systems. Through this process, any software engineering team can gain a deeper understanding of system performance, so you can perform ongoing maintenance and ship the features your customers need. This practical book explains the value of observable systems and shows you how to build an observability-driven development practice.

Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to make improvements from what you're doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics monitoring and log management. You'll also learn the impact observability has on organization culture.

You'll explore:

  • The value of practicing observability when delivering and managing complex cloud native applications and systems
  • The impact observability has across the entire software engineering cycle
  • Software ownership: how different functional teams help achieve system SLOs
  • How software developers contribute to customer experience and business impact
  • How to produce quality code for context-aware system debugging and maintenance
  • How data-rich analytics can help you find answers quickly when maintaining site reliability

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who this is for
    2. Why we wrote this book
    3. What you will learn
  2. 1. What is Observability?
    1. The mathematical definition of observability
    2. Applying observability to software systems
    3. Mischaracterizations of observability for software
    4. Why observability matters now
    5. Is this really the best way?
    6. Why are metrics and monitoring not enough?
    7. Debugging with metrics vs. observability
    8. The role of cardinality
    9. Debugging with observability
    10. Observability is for modern systems
    11. Conclusion
  3. 2. How Observability Differs from Monitoring
    1. How monitoring data is used
      1. Troubleshooting behaviors when using dashboards
      2. The limitations of troubleshooting by intuition
      3. Traditional monitoring is fundamentally reactive
    2. How observability is different
    3. Conclusion
  4. 3. Lessons from Scaling Without Observability
    1. An introduction to Parse
    2. Scaling at Parse
    3. The evolution toward modern systems
    4. The evolution toward modern practices
    5. Shifting practices at Parse
    6. Conclusion
  5. 4. How Observability Relates to DevOps, SRE, and Cloud Native
    1. Cloud Native DevOps, and SRE in a nutshell
    2. Observability: Debugging Then vs. Now
      1. Observability empowers DevOps and SRE practices
  6. 5. Structured Events Are the Building Blocks of Observability
    1. Debugging with structured events
    2. The limitations of metrics as a building block
    3. The limitations of unstructured data as a building block
    4. Properties of events that are useful in debugging
    5. Conclusion
  7. 6. Stitching Events into Traces
    1. Distributed tracing and why it matters now
      1. The components of tracing
    2. Instrumenting a trace the hard way
      1. Adding custom fields into trace spans
    3. Stitching events into traces
    4. Conclusion
  8. 7. Analyzing Events to Achieve Observability
    1. Debugging from known conditions
    2. Debugging from first principles
      1. The core analysis loop
      2. Automating the brute force portion of the core analysis loop
    3. This misleading promise of AIOps
    4. Conclusion
  9. 8. How Observability and Monitoring Come Together
    1. Where Monitoring Fits
    2. Infrastructure Considerations vs. Software Considerations
    3. Assessing Your Organizational Needs
      1. Exceptions: Infrastructure Monitoring That Can’t Be Ignored
    4. Real World Examples
    5. Conclusion
  10. 9. Applying observability practices in your team
    1. Join a community group
    2. Start with the biggest pain points
    3. Buy instead of build
    4. Flesh out your instrumentation iteratively
    5. Look for opportunities to leverage existing efforts
    6. The last push is the hardest to complete
    7. Conclusion
  11. 10. Observability-Driven Development
    1. Test-driven development
    2. Observability in the development cycle
    3. Determining where to debug
      1. Debugging in the time of microservices
      2. How instrumentation drives observability
    4. Shifting observability left
  12. 11. Using Service Level Objectives for Reliability
    1. Introduction to Service Level Objectives
      1. Traditional Monitoring Approaches Create Dangerous Alert Fatigue
      2. Distributed Systems Exacerbate the Alerting Problem
      3. Static Thresholds Can’t Reliably Indicate Degraded User Experience
      4. Reliable Alerting with SLOs
      5. Changing Culture Toward SLO-Based Alerts: A Case Study
    2. Conclusion
  13. 12. Using observability data to model actionable SLOs
    1. Alerting before your error budget is empty
    2. Framing time as a sliding window
    3. Forecast models to create a predictive burn alert
      1. The lookahead window
      2. The baseline window
      3. Acting on SLO burn alerts
    4. Observability data for SLOs vs. time series data
    5. Conclusion
  14. 13. Cheap and Accurate Enough: Sampling
    1. Sampling to refine your data collection
    2. Different approaches to sampling
      1. Constant-probability sampling
      2. Sampling on recent traffic volume
      3. Sampling based on event content (keys)
      4. Combining per-key and historical methods
      5. Choosing dynamic sampling options
      6. When to make a sampling decision for traces
    3. Translating sampling strategies into code
      1. The base case
      2. Fixed-rate sampling
      3. Recording the sample rate
      4. Consistent sampling
      5. Target Rate Sampling
      6. Having more than one static sample rate
      7. Sampling by key and target rate
      8. Sampling with dynamic rates on arbitrarily many keys
      9. Putting it all together: head and tail per-key target rate sampling
    4. Conclusion
  15. 14. The Business Case for Observability
    1. The reactive approach to introducing change
    2. The proactive approach to introducing change
    3. Introducing observability as a practice
    4. Using the appropriate tools
      1. Instrumentation
      2. Data storage and analytics
      3. Rolling out tools to your teams
    5. Knowing when you have enough observability
    6. Conclusion
  16. 15. An Observability Maturity Model
    1. A foreword about maturity models
    2. Why observability needs a maturity model
    3. About the Observability Maturity Model
    4. Capabilities referenced in the OMM
      1. Respond to system failure with resilience
      2. Deliver high quality code
      3. Manage complexity and technical debt
      4. Release on a predictable cadence
      5. Understand user behavior
    5. Using the OMM for your organization
    6. Conclusion

Product information

  • Title: Observability Engineering
  • Author(s): Charity Majors, Liz Fong-Jones, George Miranda
  • Release date: May 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492076445