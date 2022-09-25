Book description
Do your product dashboards look funky? Are your quarterly reports stale? Is the dataset you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to any of the questions above, this book is for you.
Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck from the data reliability company Monte Carlo explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies.
- Build more trustworthy and reliable data pipelines
- Write scripts to make data checks and identify broken pipelines with data observability
- Program your own data quality monitors from scratch
- Develop and lead data quality initiatives at your company
- Generate a dashboard to highlight your company's key data assets
- Automate data lineage graphs across your data ecosystem
- Build anomaly detectors for your critical data assets
Table of contents
- 1. Why Data Quality Deserves Attention—Now
-
2. Architecting for Data Reliability
- Measuring and Maintaining High Data Reliability at Ingestion
- Measuring and Maintaining Data Quality in the Pipeline
- Understanding Data Quality Downstream
- Building Your Data Platform
- Developing Trust in Your Data
- Summary
-
3. Fixing Data Quality Issues at Scale
- Fixing Quality Issues in Software Development
- Data Incident Management
- Proactive Incident Prevention
- Case Study: Data Incident Management at PagerDuty
- Summary
-
4. Preventing Broken Data Systems
- Understanding the Difference Between Operational and Analytical Data
- What Makes Them Different?
- Data warehouses vs. data lakes
- Collecting data quality metrics
- Designing a data catalog
- Summary
-
5. Collecting, Cleaning, Transforming, and Testing Data
- Data Collecting
- Data Cleaning
- Batch vs. Stream Processing
- Data Normalization (or “Operational Data Transformations”)
- Running Analytical Data Transformations
- Summary
-
6. Monitoring and anomaly detection for your data pipelines
- Monitoring and Anomaly Detection: The Basics
- Anomaly detection techniques for data pipelines
- Building monitors for freshness and distribution
- Visualizing lineage
- Investigating a data anomaly
- Scaling anomaly detection with Python and machine learning
- Improving data monitoring alerting with machine learning
- Summary
-
7. Democratizing Data Quality
- Treating your “Data” like a Product
- Building Trust in your Data Platform
- Assigning Ownership for Data Quality
- Who is responsible for data reliability?
- Creating Accountability for Data Quality
- Balancing data accessibility with trust
- Certifying your data
-
6 steps to implementing a data certification program
- Step 1: Build out your data observability capabilities
- Step 2: Determine your data owners
- Step 3. Understand what “good” data looks like
- Step 4: Set clear SLAs for your most important data sets
- Step 5: Develop your communication and incident management processes
- Step 6: Determine a mechanism to tag the data as certified
- Step 7: Train your data team and downstream consumers
- Case Study: Toast’s Journey to Finding the Right Structure for their Data Team
- Increasing data literacy
- Prioritizing data governance and compliance
- Summary
-
8. Building a data reliability workflow
- Implementing the DataOps lifecycle
- The DataOps framework
- Assembling a data reliability workflow
- Building End-to-End Field Level Lineage for Modern Data Systems
- Case Study: Architecting for Data Reliability at Fox
- Summary
-
9. Data Quality in the Real World: Conversations and Case Studies
- Building a data mesh for greater data quality
- Why implement a data mesh?
-
A conversation with Zhamak Dehghani: The role of data quality across the data mesh
- Can you build a data mesh from a single solution?
- Is data mesh another word for data virtualization?
- Does each data product team manage their own separate data stores?
- Is a self-serve data platform the same thing as a decentralized data mesh?
- Is the data mesh right for all data teams?
- Does one person on your team “own” the data mesh?
- Does the data mesh cause friction between data engineers and data analysts?
-
Case Study: Kolibri Games’ Data Stack Journey
- 2016: First Data Needs
- 2017: Pursuing Performance Marketing
- 2017: Data Tech Stack
- 2018: Professionalize and Centralize
- 2018: Data Tech Stack
- 2019: Getting Data-Oriented
- 2020: Getting Data-Driven
- 2021: Building a Data Mesh
- 5 Key Takeaways from a 5-Year Data Evolution
- Knowledge graphs: the key to more accessible data
- Unlocking the value of metadata with data discovery
- Data warehouse / lake considerations
- Deciding when to get started with data quality at your company
- Data quality starts with trust
- Summary
- About the Authors
