Fundamentals of Data Observability

Book description

Quickly detect, troubleshoot, and prevent a wide range of data issues through data observability, a set of best practices that enables data teams to gain greater visibility of data and its usage. If you're a data engineer, data architect, or machine learning engineer who depends on the quality of your data, this book shows you how to focus on the practical aspects of introducing data observability in your everyday work.

Author Andy Petrella helps you build the right habits to identify and solve data issues, such as data drifts and poor quality, so you can stop their propagation in data applications, pipelines, and analytics. You'll learn ways to introduce data observability, including setting up a framework for generating and collecting all the information you need.

  • Learn the core principles and benefits of data observability
  • Use data observability to detect, troubleshoot, and prevent data issues
  • Follow the book's recipes to implement observability in your data projects
  • Use data observability to create a trustworthy communication framework with data consumers
  • Learn how to educate your peers about the benefits of data observability

Table of contents

  1. Preface
    1. Overview of the Book
    2. Who Should Read This Book
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Online Learning
    6. How to Contact Us
    7. Acknowledgments
  2. I. Introducing Data Observability
  3. 1. Introducing Data Observability
    1. Scaling Data Teams
      1. Challenges of Scaling Data Teams
      2. Segregated Roles and Responsibilities and Organizational Complexity
      3. Anatomy of Data Issues and Consequences
      4. Impact of Data Issues on Data Team Dynamics
      5. Scaling AI Roadblocks
    2. Challenges with Current Data Management Practices
      1. Effects of Data Governance at Scale
      2. Data Observability to the Rescue
      3. The Areas of Observability
    3. How Data Teams Can Leverage Data Observability Now
      1. Low Latency Data Issues Detection
      2. Efficient Data Issues Troubleshooting
      3. Preventing Data Issues
      4. Decentralized Data Quality Management
      5. Complementing Existing Data Governance Capabilities
      6. The Future and Beyond
    4. Conclusion
  4. 2. Components of Data Observability
    1. Channels of Data Observability Information
      1. Logs
      2. Traces
      3. Metrics
    2. Observations Model
      1. Physical Space
      2. Server
      3. User
      4. Static Space
      5. Dynamic Space
    3. Expectations
      1. Rules
      2. Automatic Anomaly Detection
      3. Prevent Garbage In, Garbage Out
    4. Conclusion
  5. 3. Roles of Data Observability in a Data Organization
    1. Data Architecture
      1. Where Does Data Observability Fit in a Data Architecture?
      2. Data Architecture with Data Observability
    2. How Data Observability Helps with Data Engineering Undercurrents
      1. Security
      2. Data Management
    3. Support for Data Mesh’s Data as Products
    4. Conclusion
  6. II. Implementing Data Observability
  7. 4. Generate Data Observations
    1. At the Source
    2. Generating Data Observations at the Source
    3. Low-Level API in Python
      1. Description of the Data Pipeline
      2. Definition of the Status of the Data Pipeline
      3. Data Observations for the Data Pipeline
      4. Generate Contextual Data Observations
      5. Generate Data-Related Observations
      6. Generate Lineage-Related Data Observations
      7. Wrap-Up: The Data-Observable Data Pipeline
      8. Using Data Observations to Address Failures of the Data Pipeline
    4. Conclusion
  8. 5. Automate the Generation of Data Observations
    1. Abstraction Strategies
      1. Event Listeners
      2. Aspect-Oriented Programming
    2. High-Level Applications
      1. No-Code Applications
      2. Low-Code Applications
    3. Differences Among Monitoring Alternatives
    4. Conclusion
  9. 6. Implementing Expectations
    1. Introducing Expectations
      1. Shift-Left Data Quality
      2. Corner Cases Discovery
      3. Lifting Service Level Indicators
      4. Using Data Profilers
    2. Maintaining Expectations
    3. Overarching Practices
      1. Fail Fast and Fail Safe
      2. Simplify Tests and Extend CI/CD
    4. Conclusion
  10. III. Data Observability in Action
  11. 7. Integrating Data Observability in Your Data Stack
    1. Ingestion Stage
      1. Ingestion Stage Data Observability Recipes
      2. Airbyte Agent
    2. Transformation
      1. Transformation Stage Data Observability Recipes
      2. Apache Spark
      3. dbt Agent
    3. Serving
      1. Recipes
      2. BigQuery in Python
      3. Orchestrated SQL with Airflow
    4. Analytics
      1. Machine Learning Recipes
      2. Business Intelligence Recipes
    5. Conclusion
  12. 8. Making Opaque Systems Translucent
    1. Data Translucence
    2. Opaque Systems
      1. SaaS
      2. Don’t Touch It; It (Kinda) Works
      3. Inherited Systems
    3. Strategies for Data Translucence
      1. Strategies
      2. The Data Observability Connector
      3. Example: Building a dbt Data Observability Connector (SaaS)
    4. Conclusion
  13. Afterword: Future Observations
    1. Unification of Processing
    2. Generative Milestones
    3. Trustable Expanded Creativity
    4. Conclusion
  14. Index
  15. About the Author

Product information

  • Title: Fundamentals of Data Observability
  • Author(s): Andy Petrella
  • Release date: August 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098133290