Distributed Tracing in Practice

Book Description

Since most applications today are distributed in some fashion, monitoring their health and performance requires a new approach. Enter distributed tracing, a method of profiling and monitoring distributed applications—particularly those that use microservice architectures. There’s just one problem: distributed tracing can be hard. But it doesn’t have to be.

With this guide, you’ll learn what distributed tracing is and how to use it to understand the performance and operation of your software. Key players at LightStep and other organizations walk you through instrumenting your code for tracing, collecting the data that your instrumentation produces, and turning it into useful operational insights. If you want to implement distributed tracing, this book tells you what you need to know.

You’ll learn:

  • The pieces of a distributed tracing deployment: instrumentation, data collection, and analysis
  • Best practices for instrumentation: methods for generating trace data from your services
  • How to deal with (or avoid) overhead using sampling and other techniques
  • How to use distributed tracing to improve baseline performance and to mitigate regressions quickly
  • Where distributed tracing is headed in the future

Table of Contents

  1. Foreword
  2. Introduction: What Is Distributed Tracing?
    1. Distributed Architectures and You
    2. Deep Systems
    3. The Difficulties of Understanding Distributed Architectures
    4. How Does Distributed Tracing Help?
    5. Distributed Tracing and You
    6. Conventions Used in This Book
    7. Using Code Examples
    8. O’Reilly Online Learning
    9. How to Contact Us
    10. Acknowledgments
  3. 1. The Problem with Distributed Tracing
    1. The Pieces of a Distributed Tracing Deployment
    2. Distributed Tracing, Microservices, Serverless, Oh My!
    3. The Benefits of Tracing
    4. Setting the Table
  4. 2. An Ontology of Instrumentation
    1. White Box Versus Black Box
    2. Application Versus System
    3. Agents Versus Libraries
    4. Propagating Context
      1. Interprocess Propagation
      2. Intraprocess Propagation
    5. The Shape of Distributed Tracing
      1. Tracing-Friendly Microservices and Serverless
      2. Tracing in a Monolith
      3. Tracing in Web and Mobile Clients
  5. 3. Open Source Instrumentation: Interfaces, Libraries, and Frameworks
    1. The Importance of Abstract Instrumentation
    2. OpenTelemetry
    3. OpenTracing and OpenCensus
      1. OpenTracing
      2. OpenCensus
    4. Other Notable Formats and Projects
      1. X-Ray
      2. Zipkin
    5. Interoperability and Migration Strategies
    6. Why Use Open Source Instrumentation?
      1. Interoperability
      2. Portability
      3. Ecosystem and Implicit Visibility
  6. 4. Best Practices for Instrumentation
    1. Tracing by Example
      1. Installing the Sample Application
      2. Adding Basic Distributed Tracing
      3. Custom Instrumentation
    2. Where to Start—Nodes and Edges
      1. Framework Instrumentation
      2. Service Mesh Instrumentation
      3. Creating Your Service Graph
    3. What’s in a Span?
      1. Effective Naming
      2. Effective Tagging
      3. Effective Logging
      4. Understanding Performance Considerations
    4. Trace-Driven Development
      1. Developing with Traces
      2. Testing with Traces
    5. Creating an Instrumentation Plan
      1. Making the Case for Instrumentation
      2. Instrumentation Quality Checklist
      3. Knowing When to Stop Instrumenting
      4. Smart and Sustainable Instrumentation Growth
  7. 5. Deploying Tracing
    1. Organizational Adoption
      1. Start Close to Your Users
      2. Start Centrally: Load Balancers and Gateways
      3. Leverage Infrastructure: RPC Frameworks and Service Meshes
      4. Make Adoption Repeatable
    2. Tracer Architecture
      1. In-Process Libraries
      2. Sidecars and Agents
      3. Collectors
      4. Centralized Storage and Analysis
      5. Incremental Deployment
    3. Data Provenance, Security, and Federation
      1. Frontend Service Telemetry
      2. Server-Side Telemetry for Managed Services
  8. 6. Overhead, Costs, and Sampling
    1. Application Overhead
      1. Latency
      2. Throughput
    2. Infrastructure Costs
      1. Network
      2. Storage
    3. Sampling
      1. Minimum Requirements
      2. Strategies
      3. Selecting Traces
    4. Off-the-Shelf ETL Solutions
  9. 7. A New Observability Scorecard
    1. The Three Pillars Defined
      1. Metrics
      2. Logging
      3. Distributed Tracing
    2. Fatal Flaws of the Three Pillars
      1. Design Goals
      2. Assessing the Three Pillars
      3. Three Pipes (Not Pillars)
    3. Observability Goals and Activities
      1. Two Goals in Observability
      2. Two Fundamental Activities in Observability
      3. A New Scorecard
      4. The Path Ahead
  10. 8. Improving Baseline Performance
    1. Measuring Performance
      1. Percentiles
      2. Histograms
    2. Defining the Critical Path
    3. Approaches to Improving Performance
      1. Individual Traces
      2. Biased Sampling and Trace Comparison
      3. Trace Search
      4. Multimodal Analysis
      5. Aggregate Analysis
      6. Correlation Analysis
  11. 9. Restoring Baseline Performance
    1. Defining the Problem
    2. Human Factors
      1. (Avoiding) Finger-Pointing
      2. “Suppressing” the Messenger
      3. Incident Hand-off
      4. Good Postmortems
    3. Approaches to Restoring Performance
      1. Integration with Alerting Workflows
      2. Individual Traces
      3. Biased Sampling
      4. Real-Time Response
      5. Knowing What’s Normal
      6. Aggregate and Correlation Root Cause Analysis
  12. 10. Are We There Yet? The Past and Present
    1. Distributed Tracing: A History of Pragmatism
      1. Request-Based Systems
      2. Response Time Matters
      3. Request-Oriented Information
    2. Notable Work
      1. Pinpoint
      2. Magpie
      3. X-Trace
      4. Dapper
    3. Where to Next?
  13. 11. Beyond Individual Requests
    1. The Value of Traces in Aggregate
      1. Example 1: Is Network Congestion Affecting My Application?
      2. Example 2: What Services Are Required to Serve an API Endpoint?
    2. Organizing the Data
      1. A Strawperson Solution
    3. What About the Trade-offs?
    4. Sampling for Aggregate Analysis
    5. The Processing Pipeline
    6. Incorporating Heterogeneous Data
    7. Custom Functions
      1. Joining with Other Data Sources
    8. Recap and Case Study
      1. The Value of Traces in Aggregate
      2. Organizing the Data
      3. Sampling for Aggregate Analysis
      4. The Processing Pipeline
      5. Incorporating Heterogeneous Data
  14. 12. Beyond Spans
    1. Why Spans Have Prevailed
      1. Visibility
      2. Pragmatism
      3. Portability
      4. Compatibility
      5. Flexibility
    2. Why Spans Aren’t Enough
      1. Graphs, Not Trees
      2. Inter-Request Dependencies
      3. Decoupled Dependencies
      4. Distributed Dataflow
      5. Machine Learning
      6. Low-Level Performance Metrics
    3. New Abstractions
    4. Seeing Causality
  15. 13. Beyond Distributed Tracing
    1. Limitations of Distributed Tracing
      1. Challenge 1: Anticipating Problems
      2. Challenge 2: Completeness Versus Costs
      3. Challenge 3: Open-Ended Use Cases
    2. Other Tools Like Distributed Tracing
    3. Census
      1. A Motivating Example
      2. A Distributed Tracing Solution?
      3. Tag Propagation and Local Metric Aggregation
      4. Comparison to Distributed Tracing
    4. Pivot Tracing
      1. Dynamic Instrumentation
      2. Recurring Problems
      3. How Does It Work?
      4. Dynamic Context
      5. Comparison to Distributed Tracing
    5. Pythia
      1. Performance Regressions
      2. Design
      3. Overheads
      4. Comparison to Distributed Tracing
  16. 14. The Future of Context Propagation
    1. Cross-Cutting Tools
    2. Use Cases
      1. Distributed Tracing
      2. Cross-Component Metrics
      3. Cross-Component Resource Management
      4. Managing Data Quality Trade-offs
      5. Failure Testing of Microservices
      6. Enforcing Cross-System Consistency
      7. Request Duplication
      8. Record Lineage in Stream Processing Systems
      9. Auditing Security Policies
      10. Testing in Production
    3. Common Themes
    4. Should You Care?
    5. The Tracing Plane
      1. Is Baggage Enough?
      2. Beyond Key-Value Pairs
      3. Compiling BDL
      4. BaggageContext
      5. Merging
      6. Overheads
  17. A. The State of Distributed Tracing Circa 2020
    1. Open Source Tracers and Trace Analysis
    2. Commercial Tracers and Trace Analyzers
    3. Language-Specific Tracing Features
      1. Java and C#
      2. Go, Rust, and C++
      3. Python, JavaScript, and Other Dynamic Languages
  18. B. Context Propagation in OpenTelemetry
    1. Why a Separate Context Model?
    2. The OpenTelemetry Context Model
      1. W3C CorrelationContext and the Correlations API
      2. Distributed and Local Context
    3. Examples and Potential Applications
  19. Bibliography
  20. Index

Product Information

  • Title: Distributed Tracing in Practice
  • Author(s): Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, Rebecca Isaacs
  • Release date: April 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492056638