book

Automating Data Quality Monitoring

by Jeremy Stanley, Paige Schwartz

January 2024

Intermediate to advanced

220 pages

6h 3m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Who Should Use This BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
High-Quality Data Is the New GoldData-Driven Companies Are Today’s DisruptersData Analytics Is DemocratizedAI and Machine Learning Are DifferentiatorsCompanies Are Investing in a Modern Data StackMore Data, More ProblemsIssues Inside the Data FactoryData MigrationsThird-Party Data SourcesCompany Growth and ChangeExogenous FactorsWhy We Need Data Quality MonitoringData ScarsData ShocksAutomating Data Quality Monitoring: The New Frontier
Monitoring RequirementsData Observability: Necessary, but Not SufficientTraditional Approaches to Data QualityManual Data Quality DetectionRule-Based TestingMetrics MonitoringAutomating Data Quality Monitoring with Unsupervised Machine LearningWhat Is Unsupervised Machine Learning?An Analogy: Lane Departure WarningsThe Limits of AutomationA Four-Pillar Approach to Data Quality Monitoring
Assessing Your DataVolumeVarietyVelocityVeracitySpecial CasesAssessing Your IndustryRegulatory PressureAI/ML RisksData as a ProductAssessing Your Data MaturityAssessing Benefits to StakeholdersEngineersData LeadershipScientistsConsumersConducting an ROI AnalysisQuantitative MeasuresQualitative MeasuresConclusion
RequirementsSensitivitySpecificityTransparencyScalabilityNonrequirementsData Quality Monitoring Is Not Outlier DetectionML Approach and AlgorithmData SamplingFeature EncodingModel DevelopmentModel ExplainabilityPutting It Together with PseudocodeOther ApplicationsConclusion
Data Challenges and MitigationsSeasonalityTime-Based FeaturesChaotic TablesUpdated-in-Place TablesColumn CorrelationsModel TestingInjecting Synthetic IssuesBenchmarkingImproving the ModelConclusion
How Notifications Facilitate Data Issue ResponseTriageRoutingResolutionDocumentationTaking Action Without NotificationsAnatomy of a NotificationVisualizationActionsText DescriptionWho Created/Last Edited the CheckDelivering NotificationsNotification AudienceNotification ChannelsNotification TimingAvoiding Alert FatigueScheduling Checks in the Right OrderClustering Alerts Using Machine LearningSuppressing NotificationsAutomating the Root Cause AnalysisConclusion
Monitoring Your Data StackData WarehousesIntegrating with Data WarehousesSecurityReconciling Data Across Multiple WarehousesData OrchestratorsIntegrating with OrchestratorsData CatalogsIntegrating with CatalogsData ConsumersAnalytics and BI ToolsMLOpsConclusion
Build Versus BuyVendor Deployment ModelsConfigurationDetermining Which Tables Are Most ImportantDeciding What Data in a Table to MonitorConfiguration at ScaleEnablementUser Roles and PermissionsOnboarding, Training, and SupportImproving Data Quality Over TimeInitiativesMetricsFrom Chaos to Clarity

Table IssuesLate ArrivalSchema ChangesUntraceable ChangesRow IssuesIncomplete RowsDuplicate RowsTemporal InconsistencyValue IssuesMissing ValuesIncorrect ValuesInvalid ValuesMulti IssuesRelational FailuresInconsistent Sources

Content preview from Automating Data Quality Monitoring

Chapter 4. Automating Data Quality Monitoring with Machine Learning

Machine learning is a statistical approach that, compared to rule-based testing and metrics monitoring, has many advantages: it’s scalable, can detect unknown-unknown changes, and, at the risk of anthropomorphizing, it’s smart. It can learn from prior inputs, use contextual information to minimize false positives, and actually understand your data better and better over time.

In the previous chapters, we’ve explored when and how automation with ML makes sense for your data quality monitoring strategy. Now it’s time to explore the core mechanism: how you can train, develop, and use a model to detect data quality issues—and even explain aspects like their severity and where they occur in your data.

In this chapter, we’ll explain which machine learning approach works best for data quality monitoring and show you the algorithm (series of steps) you can follow to implement this approach. We’ll answer questions like how much data you should sample, and how to make the model’s outputs explainable. It’s important to caveat that following the steps here won’t result in a model that’s ready to monitor real-world data. In Chapter 5, we’ll turn to the practical aspects of tuning and testing your system so that it functions reliably in an enterprise setting.

Requirements

There are many ML techniques you could potentially apply to a given problem. To figure out the right approach for your use case, it’s essential to define ...