book

Data Quality Fundamentals

by Barr Moses, Lior Gavish, Molly Vorwerck

September 2022

Beginner to intermediate

308 pages

8h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Why Data Quality Deserves Attention—Now
What Is Data Quality?Framing the Current MomentUnderstanding the “Rise of Data Downtime”Other Industry Trends Contributing to the Current MomentSummary
2. Assembling the Building Blocks of a Reliable Data System
Understanding the Difference Between Operational and Analytical DataWhat Makes Them Different?Data Warehouses Versus Data LakesData Warehouses: Table Types at the Schema LevelData Lakes: Manipulations at the File LevelWhat About the Data Lakehouse?Syncing Data Between Warehouses and LakesCollecting Data Quality MetricsWhat Are Data Quality Metrics?How to Pull Data Quality MetricsUsing Query Logs to Understand Data Quality in the WarehouseUsing Query Logs to Understand Data Quality in the LakeDesigning a Data CatalogBuilding a Data CatalogSummary
3. Collecting, Cleaning, Transforming, and Testing Data
Collecting DataApplication Log DataAPI ResponsesSensor DataCleaning DataBatch Versus Stream ProcessingData Quality for Stream ProcessingNormalizing DataHandling Heterogeneous Data SourcesSchema Checking and Type CoercionSyntactic Versus Semantic Ambiguity in DataManaging Operational Data Transformations Across AWS Kinesis and Apache KafkaRunning Analytical Data TransformationsEnsuring Data Quality During ETLEnsuring Data Quality During TransformationAlerting and Testingdbt Unit TestingGreat Expectations Unit TestingDeequ Unit TestingManaging Data Quality with Apache AirflowScheduler SLAsInstalling Circuit Breakers with Apache AirflowSQL Check OperatorsSummary
4. Monitoring and Anomaly Detection for Your Data Pipelines
Knowing Your Known Unknowns and Unknown UnknownsBuilding an Anomaly Detection AlgorithmMonitoring for FreshnessUnderstanding DistributionBuilding Monitors for Schema and LineageAnomaly Detection for Schema Changes and LineageVisualizing LineageInvestigating a Data AnomalyScaling Anomaly Detection with Python and Machine LearningImproving Data Monitoring Alerting with Machine LearningAccounting for False Positives and False NegativesImproving Precision and RecallDetecting Freshness Incidents with Data MonitoringF-ScoresDoes Model Accuracy Matter?Beyond the Surface: Other Useful Anomaly Detection ApproachesDesigning Data Quality Monitors for Warehouses Versus LakesSummary
5. Architecting for Data Reliability
Measuring and Maintaining High Data Reliability at IngestionMeasuring and Maintaining Data Quality in the PipelineUnderstanding Data Quality DownstreamBuilding Your Data PlatformData IngestionData Storage and ProcessingData Transformation and ModelingBusiness Intelligence and AnalyticsData Discovery and GovernanceDeveloping Trust in Your DataData ObservabilityMeasuring the ROI on Data QualityHow to Set SLAs, SLOs, and SLIs for Your DataCase Study: BlinkistSummary
6. Fixing Data Quality Issues at Scale
Fixing Quality Issues in Software DevelopmentData Incident ManagementIncident DetectionResponseRoot Cause AnalysisResolutionBlameless PostmortemIncident Response and MitigationEstablishing a Routine of Incident ManagementWhy Data Incident Commanders MatterCase Study: Data Incident Management at PagerDutyThe DataOps Landscape at PagerDutyData Challenges at PagerDutyUsing DevOps Best Practices to Scale Data Incident ManagementSummary
7. Building End-to-End Lineage
Building End-to-End Field-Level Lineage for Modern Data SystemsBasic Lineage RequirementsData Lineage DesignParsing the DataBuilding the User InterfaceCase Study: Architecting for Data Reliability at FoxExercise “Controlled Freedom” When Dealing with StakeholdersInvest in a Decentralized Data TeamAvoid Shiny New Toys in Favor of Problem-Solving TechTo Make Analytics Self-Serve, Invest in Data TrustSummary
8. Democratizing Data Quality
Treating Your “Data” Like a ProductPerspectives on Treating Data Like a ProductConvoy Case Study: Data as a Service or OutputUber Case Study: The Rise of the Data Product ManagerApplying the Data-as-a-Product ApproachBuilding Trust in Your Data PlatformAlign Your Product’s Goals with the Goals of the BusinessGain Feedback and Buy-in from the Right StakeholdersPrioritize Long-Term Growth and Sustainability Versus Short-Term GainsSign Off on Baseline Metrics for Your Data and How You Measure ThemKnow When to Build Versus BuyAssigning Ownership for Data QualityChief Data OfficerBusiness Intelligence AnalystAnalytics EngineerData ScientistData Governance LeadData EngineerData Product ManagerWho Is Responsible for Data Reliability?Creating Accountability for Data QualityBalancing Data Accessibility with TrustCertifying Your DataSeven Steps to Implementing a Data Certification ProgramCase Study: Toast’s Journey to Finding the Right Structure for Their Data TeamIn the Beginning: When a Small Team Struggles to Meet Data DemandsSupporting Hypergrowth as a Decentralized Data OperationRegrouping, Recentralizing, and Refocusing on Data TrustConsiderations When Scaling Your Data TeamIncreasing Data LiteracyPrioritizing Data Governance and CompliancePrioritizing a Data CatalogBeyond Catalogs: Enforcing Data GovernanceBuilding a Data Quality StrategyMake Leadership Accountable for Data QualitySet Data Quality KPIsSpearhead a Data Governance ProgramAutomate Your Lineage and Data Governance ToolingCreate a Communications PlanSummary
9. Data Quality in the Real World: Conversations and Case Studies
Building a Data Mesh for Greater Data QualityDomain-Oriented Data Owners and PipelinesSelf-Serve FunctionalityInteroperability and Standardization of CommunicationsWhy Implement a Data Mesh?To Mesh or Not to Mesh? That Is the QuestionCalculating Your Data Mesh ScoreA Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data MeshCan You Build a Data Mesh from a Single Solution?Is Data Mesh Another Word for Data Virtualization?Does Each Data Product Team Manage Their Own Separate Data Stores?Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh?Is the Data Mesh Right for All Data Teams?Does One Person on Your Team “Own” the Data Mesh?Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts?Case Study: Kolibri Games’ Data Stack JourneyFirst Data NeedsPursuing Performance Marketing2018: Professionalize and CentralizeGetting Data-OrientedGetting Data-DrivenBuilding a Data MeshFive Key Takeaways from a Five-Year Data EvolutionMaking Metadata Work for the BusinessUnlocking the Value of Metadata with Data DiscoveryData Warehouse and Lake ConsiderationsData Catalogs Can Drown in a Data Lake—or Even a Data MeshMoving from Traditional Data Catalogs to Modern Data DiscoveryDeciding When to Get Started with Data Quality at Your CompanyYou’ve Recently Migrated to the CloudYour Data Stack Is Scaling with More Data Sources, More Tables, and More ComplexityYour Data Team Is GrowingYour Team Is Spending at Least 30% of Their Time Firefighting Data Quality IssuesYour Team Has More Data Consumers Than They Did One Year AgoYour Company Is Moving to a Self-Service Analytics ModelData Is a Key Part of the Customer Value PropositionData Quality Starts with TrustSummary

10. Pioneering the Future of Reliable Data Systems
Be Proactive, Not ReactivePredictions for the Future of Data Quality and ReliabilityData Warehouses and Lakes Will MergeEmergence of New Roles on the Data TeamRise of AutomationMore Distributed Environments and the Rise of Data DomainsSo Where Do We Go from Here?
Index
About the Authors

Content preview from Data Quality Fundamentals

Chapter 7. Building End-to-End Lineage

On July 27, 2004, a five-year-old startup by the name of Google was faced with a serious problem: their application was down.

For several hours, users across the United States, France, and Great Britain were unable to access the popular search engine. The then-700-person company and their millions of users were left in the dark as engineers struggled to fix the problem and discover the root cause of the issue. By midday, a tedious and intensive process conducted by a few panicked engineers determined that the MyDoom virus was to blame.

In 2021, an outage of that length and scale was considered rather anomalous, but 15 years ago, these types of software outages weren’t uncommon. After leading teams through several of these experiences over the years, Benjamin Treynor Sloss, a Google engineering manager at the time, determined there had to be a better way to manage and prevent these dizzying fire drills, not just at Google but across the industry.

Inspired by his early career building data and IT infrastructure, Sloss codified his learnings as an entirely new discipline—site reliability engineering (SRE)—dedicated to optimizing the maintenance and operations of software systems (like Google’s search engine) with reliability in mind.

According to Sloss and others paving the way forward for the discipline, SRE was about automating away the need to worry about edge cases and unknown unknowns (like buggy code, server failures, and viruses). Ultimately, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Storytelling with Data: A Data Visualization Guide for Business Professionals

Publisher Resources

ISBN: 9781098112035Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Quality Fundamentals

by Barr Moses, Lior Gavish, Molly Vorwerck

Chapter 7. Building End-to-End Lineage

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.