Skip to Content
Automating Data Quality Monitoring
book

Automating Data Quality Monitoring

by Jeremy Stanley, Paige Schwartz
January 2024
Intermediate to advanced
220 pages
6h 3m
English
O'Reilly Media, Inc.
Content preview from Automating Data Quality Monitoring

Chapter 4. Automating Data Quality Monitoring with Machine Learning

Machine learning is a statistical approach that, compared to rule-based testing and metrics monitoring, has many advantages: it’s scalable, can detect unknown-unknown changes, and, at the risk of anthropomorphizing, it’s smart. It can learn from prior inputs, use contextual information to minimize false positives, and actually understand your data better and better over time.

In the previous chapters, we’ve explored when and how automation with ML makes sense for your data quality monitoring strategy. Now it’s time to explore the core mechanism: how you can train, develop, and use a model to detect data quality issues—and even explain aspects like their severity and where they occur in your data.

In this chapter, we’ll explain which machine learning approach works best for data quality monitoring and show you the algorithm (series of steps) you can follow to implement this approach. We’ll answer questions like how much data you should sample, and how to make the model’s outputs explainable. It’s important to caveat that following the steps here won’t result in a model that’s ready to monitor real-world data. In Chapter 5, we’ll turn to the practical aspects of tuning and testing your system so that it functions reliably in an enterprise setting.

Requirements

There are many ML techniques you could potentially apply to a given problem. To figure out the right approach for your use case, it’s essential to define ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Driving Data Quality with Data Contracts

Driving Data Quality with Data Contracts

Andrew Jones
Data Governance: The Definitive Guide

Data Governance: The Definitive Guide

Evren Eryurek, Uri Gilad, Valliappa Lakshmanan, Anita Kibunguchy-Grant, Jessi Ashdown
Data Quality Fundamentals

Data Quality Fundamentals

Barr Moses, Lior Gavish, Molly Vorwerck

Publisher Resources

ISBN: 9781098145927Errata Page