Chapter 4. Automating Data Quality Monitoring with Machine Learning
Machine learning is a statistical approach that, compared to rule-based testing and metrics monitoring, has many advantages: it’s scalable, can detect unknown-unknown changes, and, at the risk of anthropomorphizing, it’s smart. It can learn from prior inputs, use contextual information to minimize false positives, and actually understand your data better and better over time.
In the previous chapters, we’ve explored when and how automation with ML makes sense for your data quality monitoring strategy. Now it’s time to explore the core mechanism: how you can train, develop, and use a model to detect data quality issues—and even explain aspects like their severity and where they occur in your data.
In this chapter, we’ll explain which machine learning approach works best for data quality monitoring and show you the algorithm (series of steps) you can follow to implement this approach. We’ll answer questions like how much data you should sample, and how to make the model’s outputs explainable. It’s important to caveat that following the steps here won’t result in a model that’s ready to monitor real-world data. In Chapter 5, we’ll turn to the practical aspects of tuning and testing your system so that it functions reliably in an enterprise setting.
Requirements
There are many ML techniques you could potentially apply to a given problem. To figure out the right approach for your use case, it’s essential to define ...