Automating Data Quality Monitoring

Book description

The world's businesses ingest a combined 2.5 quintillion bytes of data every day. But how much of this vast amount of data--used to build products, power AI systems, and drive business decisions--is poor quality or just plain bad? This practical book shows you how to ensure that the data your organization relies on contains only high-quality records.

Most data engineers, data analysts, and data scientists genuinely care about data quality, but they often don't have the time, resources, or understanding to create a data quality monitoring solution that succeeds at scale. In this book, Jeremy Stanley and Paige Schwartz from Anomalo explain how you can use automated data quality monitoring to cover all your tables efficiently, proactively alert on every category of issue, and resolve problems immediately.

This book will help you:

  • Learn why data quality is a business imperative
  • Understand and assess unsupervised learning models for detecting data issues
  • Implement notifications that reduce alert fatigue and let you triage and resolve issues quickly
  • Integrate automated data quality monitoring with data catalogs, orchestration layers, and BI and ML systems
  • Understand the limits of automated data quality monitoring and how to overcome them
  • Learn how to deploy and manage your monitoring solution at scale
  • Maintain automated data quality monitoring for the long term

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Who Should Use This Book
    2. Conventions Used in This Book
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  3. 1. The Data Quality Imperative
    1. High-Quality Data Is the New Gold
      1. Data-Driven Companies Are Today’s Disrupters
      2. Data Analytics Is Democratized
      3. AI and Machine Learning Are Differentiators
      4. Companies Are Investing in a Modern Data Stack
    2. More Data, More Problems
      1. Issues Inside the Data Factory
      2. Data Migrations
      3. Third-Party Data Sources
      4. Company Growth and Change
      5. Exogenous Factors
    3. Why We Need Data Quality Monitoring
      1. Data Scars
      2. Data Shocks
    4. Automating Data Quality Monitoring: The New Frontier
  4. 2. Data Quality Monitoring Strategies and the Role of Automation
    1. Monitoring Requirements
    2. Data Observability: Necessary, but Not Sufficient
    3. Traditional Approaches to Data Quality
      1. Manual Data Quality Detection
      2. Rule-Based Testing
      3. Metrics Monitoring
    4. Automating Data Quality Monitoring with Unsupervised Machine Learning
      1. What Is Unsupervised Machine Learning?
      2. An Analogy: Lane Departure Warnings
      3. The Limits of Automation
    5. A Four-Pillar Approach to Data Quality Monitoring
  5. 3. Assessing the Business Impact of Automated Data Quality Monitoring
    1. Assessing Your Data
      1. Volume
      2. Variety
      3. Velocity
      4. Veracity
      5. Special Cases
    2. Assessing Your Industry
      1. Regulatory Pressure
      2. AI/ML Risks
      3. Data as a Product
    3. Assessing Your Data Maturity
    4. Assessing Benefits to Stakeholders
      1. Engineers
      2. Data Leadership
      3. Scientists
      4. Consumers
    5. Conducting an ROI Analysis
      1. Quantitative Measures
      2. Qualitative Measures
    6. Conclusion
  6. 4. Automating Data Quality Monitoring with Machine Learning
    1. Requirements
      1. Sensitivity
      2. Specificity
      3. Transparency
      4. Scalability
      5. Nonrequirements
      6. Data Quality Monitoring Is Not Outlier Detection
    2. ML Approach and Algorithm
      1. Data Sampling
      2. Feature Encoding
      3. Model Development
      4. Model Explainability
    3. Putting It Together with Pseudocode
    4. Other Applications
    5. Conclusion
  7. 5. Building a Model That Works on Real-World Data
    1. Data Challenges and Mitigations
      1. Seasonality
      2. Time-Based Features
      3. Chaotic Tables
      4. Updated-in-Place Tables
      5. Column Correlations
    2. Model Testing
      1. Injecting Synthetic Issues
      2. Benchmarking
      3. Improving the Model
    3. Conclusion
  8. 6. Implementing Notifications While Avoiding Alert Fatigue
    1. How Notifications Facilitate Data Issue Response
      1. Triage
      2. Routing
      3. Resolution
      4. Documentation
    2. Taking Action Without Notifications
    3. Anatomy of a Notification
      1. Visualization
      2. Actions
      3. Text Description
      4. Who Created/Last Edited the Check
    4. Delivering Notifications
      1. Notification Audience
      2. Notification Channels
      3. Notification Timing
    5. Avoiding Alert Fatigue
      1. Scheduling Checks in the Right Order
      2. Clustering Alerts Using Machine Learning
      3. Suppressing Notifications
    6. Automating the Root Cause Analysis
    7. Conclusion
  9. 7. Integrating Monitoring with Data Tools and Systems
    1. Monitoring Your Data Stack
    2. Data Warehouses
      1. Integrating with Data Warehouses
      2. Security
      3. Reconciling Data Across Multiple Warehouses
    3. Data Orchestrators
      1. Integrating with Orchestrators
    4. Data Catalogs
      1. Integrating with Catalogs
    5. Data Consumers
      1. Analytics and BI Tools
      2. MLOps
    6. Conclusion
  10. 8. Operating Your Solution at Scale
    1. Build Versus Buy
      1. Vendor Deployment Models
    2. Configuration
      1. Determining Which Tables Are Most Important
      2. Deciding What Data in a Table to Monitor
      3. Configuration at Scale
    3. Enablement
      1. User Roles and Permissions
      2. Onboarding, Training, and Support
    4. Improving Data Quality Over Time
      1. Initiatives
      2. Metrics
    5. From Chaos to Clarity
  11. Appendix. Types of Data Quality Issues
    1. Table Issues
      1. Late Arrival
      2. Schema Changes
      3. Untraceable Changes
    2. Row Issues
      1. Incomplete Rows
      2. Duplicate Rows
      3. Temporal Inconsistency
    3. Value Issues
      1. Missing Values
      2. Incorrect Values
      3. Invalid Values
    4. Multi Issues
      1. Relational Failures
      2. Inconsistent Sources
  12. Index
  13. About the Authors

Product information

  • Title: Automating Data Quality Monitoring
  • Author(s): Jeremy Stanley, Paige Schwartz
  • Release date: January 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098145934