book

Data Quality Fundamentals

by Barr Moses, Lior Gavish, Molly Vorwerck

September 2022

Beginner to intermediate

308 pages

8h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Why Data Quality Deserves Attention—Now
What Is Data Quality?Framing the Current MomentUnderstanding the “Rise of Data Downtime”Other Industry Trends Contributing to the Current MomentSummary
2. Assembling the Building Blocks of a Reliable Data System
Understanding the Difference Between Operational and Analytical DataWhat Makes Them Different?Data Warehouses Versus Data LakesData Warehouses: Table Types at the Schema LevelData Lakes: Manipulations at the File LevelWhat About the Data Lakehouse?Syncing Data Between Warehouses and LakesCollecting Data Quality MetricsWhat Are Data Quality Metrics?How to Pull Data Quality MetricsUsing Query Logs to Understand Data Quality in the WarehouseUsing Query Logs to Understand Data Quality in the LakeDesigning a Data CatalogBuilding a Data CatalogSummary
3. Collecting, Cleaning, Transforming, and Testing Data
Collecting DataApplication Log DataAPI ResponsesSensor DataCleaning DataBatch Versus Stream ProcessingData Quality for Stream ProcessingNormalizing DataHandling Heterogeneous Data SourcesSchema Checking and Type CoercionSyntactic Versus Semantic Ambiguity in DataManaging Operational Data Transformations Across AWS Kinesis and Apache KafkaRunning Analytical Data TransformationsEnsuring Data Quality During ETLEnsuring Data Quality During TransformationAlerting and Testingdbt Unit TestingGreat Expectations Unit TestingDeequ Unit TestingManaging Data Quality with Apache AirflowScheduler SLAsInstalling Circuit Breakers with Apache AirflowSQL Check OperatorsSummary
4. Monitoring and Anomaly Detection for Your Data Pipelines
Knowing Your Known Unknowns and Unknown UnknownsBuilding an Anomaly Detection AlgorithmMonitoring for FreshnessUnderstanding DistributionBuilding Monitors for Schema and LineageAnomaly Detection for Schema Changes and LineageVisualizing LineageInvestigating a Data AnomalyScaling Anomaly Detection with Python and Machine LearningImproving Data Monitoring Alerting with Machine LearningAccounting for False Positives and False NegativesImproving Precision and RecallDetecting Freshness Incidents with Data MonitoringF-ScoresDoes Model Accuracy Matter?Beyond the Surface: Other Useful Anomaly Detection ApproachesDesigning Data Quality Monitors for Warehouses Versus LakesSummary
5. Architecting for Data Reliability
Measuring and Maintaining High Data Reliability at IngestionMeasuring and Maintaining Data Quality in the PipelineUnderstanding Data Quality DownstreamBuilding Your Data PlatformData IngestionData Storage and ProcessingData Transformation and ModelingBusiness Intelligence and AnalyticsData Discovery and GovernanceDeveloping Trust in Your DataData ObservabilityMeasuring the ROI on Data QualityHow to Set SLAs, SLOs, and SLIs for Your DataCase Study: BlinkistSummary
6. Fixing Data Quality Issues at Scale
Fixing Quality Issues in Software DevelopmentData Incident ManagementIncident DetectionResponseRoot Cause AnalysisResolutionBlameless PostmortemIncident Response and MitigationEstablishing a Routine of Incident ManagementWhy Data Incident Commanders MatterCase Study: Data Incident Management at PagerDutyThe DataOps Landscape at PagerDutyData Challenges at PagerDutyUsing DevOps Best Practices to Scale Data Incident ManagementSummary
7. Building End-to-End Lineage
Building End-to-End Field-Level Lineage for Modern Data SystemsBasic Lineage RequirementsData Lineage DesignParsing the DataBuilding the User InterfaceCase Study: Architecting for Data Reliability at FoxExercise “Controlled Freedom” When Dealing with StakeholdersInvest in a Decentralized Data TeamAvoid Shiny New Toys in Favor of Problem-Solving TechTo Make Analytics Self-Serve, Invest in Data TrustSummary
8. Democratizing Data Quality
Treating Your “Data” Like a ProductPerspectives on Treating Data Like a ProductConvoy Case Study: Data as a Service or OutputUber Case Study: The Rise of the Data Product ManagerApplying the Data-as-a-Product ApproachBuilding Trust in Your Data PlatformAlign Your Product’s Goals with the Goals of the BusinessGain Feedback and Buy-in from the Right StakeholdersPrioritize Long-Term Growth and Sustainability Versus Short-Term GainsSign Off on Baseline Metrics for Your Data and How You Measure ThemKnow When to Build Versus BuyAssigning Ownership for Data QualityChief Data OfficerBusiness Intelligence AnalystAnalytics EngineerData ScientistData Governance LeadData EngineerData Product ManagerWho Is Responsible for Data Reliability?Creating Accountability for Data QualityBalancing Data Accessibility with TrustCertifying Your DataSeven Steps to Implementing a Data Certification ProgramCase Study: Toast’s Journey to Finding the Right Structure for Their Data TeamIn the Beginning: When a Small Team Struggles to Meet Data DemandsSupporting Hypergrowth as a Decentralized Data OperationRegrouping, Recentralizing, and Refocusing on Data TrustConsiderations When Scaling Your Data TeamIncreasing Data LiteracyPrioritizing Data Governance and CompliancePrioritizing a Data CatalogBeyond Catalogs: Enforcing Data GovernanceBuilding a Data Quality StrategyMake Leadership Accountable for Data QualitySet Data Quality KPIsSpearhead a Data Governance ProgramAutomate Your Lineage and Data Governance ToolingCreate a Communications PlanSummary
9. Data Quality in the Real World: Conversations and Case Studies
Building a Data Mesh for Greater Data QualityDomain-Oriented Data Owners and PipelinesSelf-Serve FunctionalityInteroperability and Standardization of CommunicationsWhy Implement a Data Mesh?To Mesh or Not to Mesh? That Is the QuestionCalculating Your Data Mesh ScoreA Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data MeshCan You Build a Data Mesh from a Single Solution?Is Data Mesh Another Word for Data Virtualization?Does Each Data Product Team Manage Their Own Separate Data Stores?Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh?Is the Data Mesh Right for All Data Teams?Does One Person on Your Team “Own” the Data Mesh?Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts?Case Study: Kolibri Games’ Data Stack JourneyFirst Data NeedsPursuing Performance Marketing2018: Professionalize and CentralizeGetting Data-OrientedGetting Data-DrivenBuilding a Data MeshFive Key Takeaways from a Five-Year Data EvolutionMaking Metadata Work for the BusinessUnlocking the Value of Metadata with Data DiscoveryData Warehouse and Lake ConsiderationsData Catalogs Can Drown in a Data Lake—or Even a Data MeshMoving from Traditional Data Catalogs to Modern Data DiscoveryDeciding When to Get Started with Data Quality at Your CompanyYou’ve Recently Migrated to the CloudYour Data Stack Is Scaling with More Data Sources, More Tables, and More ComplexityYour Data Team Is GrowingYour Team Is Spending at Least 30% of Their Time Firefighting Data Quality IssuesYour Team Has More Data Consumers Than They Did One Year AgoYour Company Is Moving to a Self-Service Analytics ModelData Is a Key Part of the Customer Value PropositionData Quality Starts with TrustSummary

10. Pioneering the Future of Reliable Data Systems
Be Proactive, Not ReactivePredictions for the Future of Data Quality and ReliabilityData Warehouses and Lakes Will MergeEmergence of New Roles on the Data TeamRise of AutomationMore Distributed Environments and the Rise of Data DomainsSo Where Do We Go from Here?
Index
About the Authors

Content preview from Data Quality Fundamentals

Chapter 4. Monitoring and Anomaly Detection for Your Data Pipelines

With Ryan Kearns

Imagine that you’ve just purchased a new car. Based on the routine prepurchase check, all systems are working according to the manual, the oil and brake fluid tanks are filled nearly to the brim, and the parts are good as new—because, well, they are.

After grabbing the keys from your dealer, you hit the road. “There’s nothing like that new car smell!” you think as you pull onto the highway. Everything is fine and dandy until you hear a loud pop. Yikes. And your car starts to wobble. You pull onto the shoulder, turn on your hazard lights, and jump out of the car. After a brief investigation, you’ve identified the alleged culprit of the loud sound—a flat tire. No matter how many tests or checks your dealership could have done to validate the health of your car, there’s no accounting for unknown unknowns (i.e, nails or debris on the highway) that might affect your vehicle.

Similarly, in data, all of the testing and data quality checks under the sun can’t fully protect you from data downtime, which can manifest at all stages of the pipeline and surface for a variety of reasons that are often unaffiliated with the data itself.

When it comes to understanding when data breaks, your best course of action is to lean on monitoring, specifically anomaly detection techniques that identify when your expected thresholds for volume, freshness, distribution, and other values don’t meet expectations.

Anomaly detection ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Storytelling with Data: A Data Visualization Guide for Business Professionals

Publisher Resources

ISBN: 9781098112035Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Quality Fundamentals

by Barr Moses, Lior Gavish, Molly Vorwerck

Chapter 4. Monitoring and Anomaly Detection for Your Data Pipelines

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.