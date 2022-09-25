Data Quality Fundamentals

by Barr Moses, Lior Gavish, Molly Vorwerck
Released September 2022
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781098112042

Book description

Do your product dashboards look funky? Are your quarterly reports stale? Is the dataset you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to any of the questions above, this book is for you.

Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck from the data reliability company Monte Carlo explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies.

  • Build more trustworthy and reliable data pipelines
  • Write scripts to make data checks and identify broken pipelines with data observability
  • Program your own data quality monitors from scratch
  • Develop and lead data quality initiatives at your company
  • Generate a dashboard to highlight your company's key data assets
  • Automate data lineage graphs across your data ecosystem
  • Build anomaly detectors for your critical data assets

Publisher resources

Table of contents

  1. 1. Why Data Quality Deserves Attention—Now
    1. What Is Data Quality?
    2. Framing the Current Moment
      1. Understanding the “Rise of Data Downtime”
      2. Other Industry Trends Contributing to the Current Moment
    3. Summary
  2. 2. Architecting for Data Reliability
    1. Measuring and Maintaining High Data Reliability at Ingestion
    2. Measuring and Maintaining Data Quality in the Pipeline
    3. Understanding Data Quality Downstream
    4. Building Your Data Platform
      1. Data Ingestion
      2. Data Storage and Processing
      3. Data Transformation and Modeling
      4. Business Intelligence and Analytics
      5. Data Discovery and Governance
    5. Developing Trust in Your Data
      1. Data Observability
      2. Measure the Cost of Broken Data
      3. How to Set SLAs, SLOs, and SLIs for Your Data
      4. Case Study: Blinkist
    6. Summary
  3. 3. Fixing Data Quality Issues at Scale
    1. Fixing Quality Issues in Software Development
    2. Data Incident Management
      1. Incident Detection
      2. Response
      3. Root Cause Analysis
      4. Resolution
      5. Blameless Post-mortem
    3. Proactive Incident Prevention
      1. Testing
      2. Installing Circuit Breakers
      3. Establish a Routine of Incident Management
    4. Case Study: Data Incident Management at PagerDuty
      1. The DataOps Landscape at PagerDuty
      2. Data Challenges at PagerDuty
      3. Using DevOps Best Practices to Scale Data Incident Management
    5. Summary
  4. 4. Preventing Broken Data Systems
    1. Understanding the Difference Between Operational and Analytical Data
    2. What Makes Them Different?
    3. Data warehouses vs. data lakes
      1. Data warehouses: table types at the schema level
      2. Data lakes: manipulations at the file level
      3. What about the data lakehouse?
      4. Syncing data between warehouses and lakes
    4. Collecting data quality metrics
      1. What are data quality metrics?
      2. How to pull data quality metrics
      3. Example: Pulling data quality metrics from Snowflake
      4. Using query logs to understand data quality in the warehouse
      5. Using query logs to understand data quality in the lake
    5. Designing a data catalog
      1. Building a data catalog
    6. Summary
  5. 5. Collecting, Cleaning, Transforming, and Testing Data
    1. Data Collecting
      1. Application Log Data
      2. API Responses
      3. Sensor Data
    2. Data Cleaning
    3. Batch vs. Stream Processing
      1. Data quality for stream processing
      2. AWS Kinesis
      3. Apache Kafka
    4. Data Normalization (or “Operational Data Transformations”)
      1. Handling heterogeneous data sources
      2. Warehouse data vs. lake data: heterogeneity edition
      3. Schema checking and type coercion
      4. Syntactic vs. semantic ambiguity in data
      5. Operational data transformations
    5. Running Analytical Data Transformations
      1. Ensuring Data Quality During ETL
      2. Ensuring data quality during transformation
    6. Summary
  6. 6. Monitoring and anomaly detection for your data pipelines
    1. Monitoring and Anomaly Detection: The Basics
      1. Why Anomaly Detection is Easy: The Central Limit Theorem
      2. Why Anomaly Detection is Hard: Anomalous vs. Interesting
      3. Error Signals: False Negatives and False Positives
      4. Precision and Recall
      5. F-Scores
      6. Does Accuracy Matter?
      7. Defining the basics of monitoring & anomaly detection
    2. Anomaly detection techniques for data pipelines
      1. Common frameworks
      2. Experiment tracking
      3. Hyperparameter tuning and search
      4. Designing data quality monitors for warehouses vs. lakes
    3. Building monitors for freshness and distribution
      1. Freshness
      2. Distribution
      3. Building monitors for schema and lineage
      4. Anomaly detection for schema changes and lineage
    4. Visualizing lineage
    5. Investigating a data anomaly
    6. Scaling anomaly detection with Python and machine learning
    7. Improving data monitoring alerting with machine learning
      1. Precision and recall
      2. Balancing precision and recall
      3. Detecting freshness incidents with data monitoring
    8. Summary
  7. 7. Democratizing Data Quality
    1. Treating your “Data” like a Product
      1. Perspectives on treating data like a product
      2. Applying the Data-as-a-Product approach
    2. Building Trust in your Data Platform
      1. Align your product’s goals with the goals of the business
      2. Gain feedback and buy-in from the right stakeholders
      3. Prioritize long-term growth and sustainability vs. short-term gains
      4. Sign-off on baseline metrics for your data and how you measure it
      5. Know when to build vs. buy
    3. Assigning Ownership for Data Quality
      1. Chief Data Officer
      2. Business Intelligence Analyst
      3. Analytics Engineer
      4. Data Scientist
      5. Data Governance Lead
      6. Data Engineer
      7. Data Product Manager
    4. Who is responsible for data reliability?
    5. Creating Accountability for Data Quality
    6. Balancing data accessibility with trust
    7. Certifying your data
    8. 6 steps to implementing a data certification program
      1. Step 1: Build out your data observability capabilities
      2. Step 2: Determine your data owners
      3. Step 3. Understand what “good” data looks like
      4. Step 4: Set clear SLAs for your most important data sets
      5. Step 5: Develop your communication and incident management processes
      6. Step 6: Determine a mechanism to tag the data as certified
      7. Step 7: Train your data team and downstream consumers
    9. Case Study: Toast’s Journey to Finding the Right Structure for their Data Team
      1. In the beginning: when a small team struggles to meet data demands
      2. Supporting hypergrowth as a decentralized data operation
      3. Regrouping, re-centralizing, and refocusing on data trust
      4. Considerations when scaling your data team
    10. Increasing data literacy
    11. Prioritizing data governance and compliance
      1. Prioritizing a data catalog
      2. Beyond catalogs: enforcing data governance
      3. Building a data quality strategy
    12. Summary
  8. 8. Building a data reliability workflow
    1. Implementing the DataOps lifecycle
    2. The DataOps framework
      1. Five best practices of DataOps
      2. Three under the radar ways organizations can benefit from DataOps
    3. Assembling a data reliability workflow
      1. Testing
      2. Continuous Integration (CI) / Continuous Delivery (CD)
      3. Data Observability
      4. Data discovery
    4. Building End-to-End Field Level Lineage for Modern Data Systems
      1. Basic lineage requirements
      2. Data lineage design
      3. Parsing the data
      4. Building the user interface
    5. Case Study: Architecting for Data Reliability at Fox
      1. Exercise “Controlled Freedom” when dealing with stakeholders
      2. Invest in a decentralized data team
      3. Avoid shiny new toys in favor of problem-solving tech
      4. To make analytics self-serve, invest in data trust
    6. Summary
  9. 9. Data Quality in the Real World: Conversations and Case Studies
    1. Building a data mesh for greater data quality
      1. Domain-oriented data owners and pipelines
      2. Self-serve functionality
      3. Interoperability and standardization of communications
    2. Why implement a data mesh?
      1. To mesh or not to mesh: that is the question
      2. Calculating your data mesh score
    3. A conversation with Zhamak Dehghani: The role of data quality across the data mesh
      1. Can you build a data mesh from a single solution?
      2. Is data mesh another word for data virtualization?
      3. Does each data product team manage their own separate data stores?
      4. Is a self-serve data platform the same thing as a decentralized data mesh?
      5. Is the data mesh right for all data teams?
      6. Does one person on your team “own” the data mesh?
      7. Does the data mesh cause friction between data engineers and data analysts?
    4. Case Study: Kolibri Games’ Data Stack Journey
      1. 2016: First Data Needs
      2. 2017: Pursuing Performance Marketing
      3. 2017: Data Tech Stack
      4. 2018: Professionalize and Centralize
      5. 2018: Data Tech Stack
      6. 2019: Getting Data-Oriented
      7. 2020: Getting Data-Driven
      8. 2021: Building a Data Mesh
      9. 5 Key Takeaways from a 5-Year Data Evolution
      10. Knowledge graphs: the key to more accessible data
      11. Unlocking the value of metadata with data discovery
      12. Data warehouse / lake considerations
      13. Deciding when to get started with data quality at your company
      14. Data quality starts with trust
    5. Summary
  10. About the Authors

