Chapter 4. Monitoring and Anomaly Detection for Your Data Pipelines
Imagine that you’ve just purchased a new car. Based on the routine prepurchase check, all systems are working according to the manual, the oil and brake fluid tanks are filled nearly to the brim, and the parts are good as new—because, well, they are.
After grabbing the keys from your dealer, you hit the road. “There’s nothing like that new car smell!” you think as you pull onto the highway. Everything is fine and dandy until you hear a loud pop. Yikes. And your car starts to wobble. You pull onto the shoulder, turn on your hazard lights, and jump out of the car. After a brief investigation, you’ve identified the alleged culprit of the loud sound—a flat tire. No matter how many tests or checks your dealership could have done to validate the health of your car, there’s no accounting for unknown unknowns (i.e, nails or debris on the highway) that might affect your vehicle.
Similarly, in data, all of the testing and data quality checks under the sun can’t fully protect you from data downtime, which can manifest at all stages of the pipeline and surface for a variety of reasons that are often unaffiliated with the data itself.
When it comes to understanding when data breaks, your best course of action is to lean on monitoring, specifically anomaly detection techniques that identify when your expected thresholds for volume, freshness, distribution, and other values don’t meet expectations.
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access