book

Data Quality Fundamentals

by Barr Moses, Lior Gavish, Molly Vorwerck

September 2022

Beginner to intermediate

308 pages

8h 43m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
What Is Data Quality?Framing the Current MomentUnderstanding the “Rise of Data Downtime”Other Industry Trends Contributing to the Current MomentSummary
Understanding the Difference Between Operational and Analytical DataWhat Makes Them Different?Data Warehouses Versus Data LakesData Warehouses: Table Types at the Schema LevelData Lakes: Manipulations at the File LevelWhat About the Data Lakehouse?Syncing Data Between Warehouses and LakesCollecting Data Quality MetricsWhat Are Data Quality Metrics?How to Pull Data Quality MetricsUsing Query Logs to Understand Data Quality in the WarehouseUsing Query Logs to Understand Data Quality in the LakeDesigning a Data CatalogBuilding a Data CatalogSummary
Collecting DataApplication Log DataAPI ResponsesSensor DataCleaning DataBatch Versus Stream ProcessingData Quality for Stream ProcessingNormalizing DataHandling Heterogeneous Data SourcesSchema Checking and Type CoercionSyntactic Versus Semantic Ambiguity in DataManaging Operational Data Transformations Across AWS Kinesis and Apache KafkaRunning Analytical Data TransformationsEnsuring Data Quality During ETLEnsuring Data Quality During TransformationAlerting and Testingdbt Unit TestingGreat Expectations Unit TestingDeequ Unit TestingManaging Data Quality with Apache AirflowScheduler SLAsInstalling Circuit Breakers with Apache AirflowSQL Check OperatorsSummary
Knowing Your Known Unknowns and Unknown UnknownsBuilding an Anomaly Detection AlgorithmMonitoring for FreshnessUnderstanding DistributionBuilding Monitors for Schema and LineageAnomaly Detection for Schema Changes and LineageVisualizing LineageInvestigating a Data AnomalyScaling Anomaly Detection with Python and Machine LearningImproving Data Monitoring Alerting with Machine LearningAccounting for False Positives and False NegativesImproving Precision and RecallDetecting Freshness Incidents with Data MonitoringF-ScoresDoes Model Accuracy Matter?Beyond the Surface: Other Useful Anomaly Detection ApproachesDesigning Data Quality Monitors for Warehouses Versus LakesSummary
Measuring and Maintaining High Data Reliability at IngestionMeasuring and Maintaining Data Quality in the PipelineUnderstanding Data Quality DownstreamBuilding Your Data PlatformData IngestionData Storage and ProcessingData Transformation and ModelingBusiness Intelligence and AnalyticsData Discovery and GovernanceDeveloping Trust in Your DataData ObservabilityMeasuring the ROI on Data QualityHow to Set SLAs, SLOs, and SLIs for Your DataCase Study: BlinkistSummary
Fixing Quality Issues in Software DevelopmentData Incident ManagementIncident DetectionResponseRoot Cause AnalysisResolutionBlameless PostmortemIncident Response and MitigationEstablishing a Routine of Incident ManagementWhy Data Incident Commanders MatterCase Study: Data Incident Management at PagerDutyThe DataOps Landscape at PagerDutyData Challenges at PagerDutyUsing DevOps Best Practices to Scale Data Incident ManagementSummary
Building End-to-End Field-Level Lineage for Modern Data SystemsBasic Lineage RequirementsData Lineage DesignParsing the DataBuilding the User InterfaceCase Study: Architecting for Data Reliability at FoxExercise “Controlled Freedom” When Dealing with StakeholdersInvest in a Decentralized Data TeamAvoid Shiny New Toys in Favor of Problem-Solving TechTo Make Analytics Self-Serve, Invest in Data TrustSummary
Treating Your “Data” Like a ProductPerspectives on Treating Data Like a ProductConvoy Case Study: Data as a Service or OutputUber Case Study: The Rise of the Data Product ManagerApplying the Data-as-a-Product ApproachBuilding Trust in Your Data PlatformAlign Your Product’s Goals with the Goals of the BusinessGain Feedback and Buy-in from the Right StakeholdersPrioritize Long-Term Growth and Sustainability Versus Short-Term GainsSign Off on Baseline Metrics for Your Data and How You Measure ThemKnow When to Build Versus BuyAssigning Ownership for Data QualityChief Data OfficerBusiness Intelligence AnalystAnalytics EngineerData ScientistData Governance LeadData EngineerData Product ManagerWho Is Responsible for Data Reliability?Creating Accountability for Data QualityBalancing Data Accessibility with TrustCertifying Your DataSeven Steps to Implementing a Data Certification ProgramCase Study: Toast’s Journey to Finding the Right Structure for Their Data TeamIn the Beginning: When a Small Team Struggles to Meet Data DemandsSupporting Hypergrowth as a Decentralized Data OperationRegrouping, Recentralizing, and Refocusing on Data TrustConsiderations When Scaling Your Data TeamIncreasing Data LiteracyPrioritizing Data Governance and CompliancePrioritizing a Data CatalogBeyond Catalogs: Enforcing Data GovernanceBuilding a Data Quality StrategyMake Leadership Accountable for Data QualitySet Data Quality KPIsSpearhead a Data Governance ProgramAutomate Your Lineage and Data Governance ToolingCreate a Communications PlanSummary
Building a Data Mesh for Greater Data QualityDomain-Oriented Data Owners and PipelinesSelf-Serve FunctionalityInteroperability and Standardization of CommunicationsWhy Implement a Data Mesh?To Mesh or Not to Mesh? That Is the QuestionCalculating Your Data Mesh ScoreA Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data MeshCan You Build a Data Mesh from a Single Solution?Is Data Mesh Another Word for Data Virtualization?Does Each Data Product Team Manage Their Own Separate Data Stores?Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh?Is the Data Mesh Right for All Data Teams?Does One Person on Your Team “Own” the Data Mesh?Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts?Case Study: Kolibri Games’ Data Stack JourneyFirst Data NeedsPursuing Performance Marketing2018: Professionalize and CentralizeGetting Data-OrientedGetting Data-DrivenBuilding a Data MeshFive Key Takeaways from a Five-Year Data EvolutionMaking Metadata Work for the BusinessUnlocking the Value of Metadata with Data DiscoveryData Warehouse and Lake ConsiderationsData Catalogs Can Drown in a Data Lake—or Even a Data MeshMoving from Traditional Data Catalogs to Modern Data DiscoveryDeciding When to Get Started with Data Quality at Your CompanyYou’ve Recently Migrated to the CloudYour Data Stack Is Scaling with More Data Sources, More Tables, and More ComplexityYour Data Team Is GrowingYour Team Is Spending at Least 30% of Their Time Firefighting Data Quality IssuesYour Team Has More Data Consumers Than They Did One Year AgoYour Company Is Moving to a Self-Service Analytics ModelData Is a Key Part of the Customer Value PropositionData Quality Starts with TrustSummary

Be Proactive, Not ReactivePredictions for the Future of Data Quality and ReliabilityData Warehouses and Lakes Will MergeEmergence of New Roles on the Data TeamRise of AutomationMore Distributed Environments and the Rise of Data DomainsSo Where Do We Go from Here?

Content preview from Data Quality Fundamentals

Chapter 5. Architecting for Data Reliability

Airbnb, the global online vacation marketplace, wrote in a 2020 post on their engineering blog that “leadership [set] high expectations for data timeliness and quality,” leading to the need to make significant investment in their data quality and governance efforts. Meanwhile, Krishna Puttaswamy and Suresh Srinivas, former engineers at Uber, wrote in a 2021 Uber Engineering blog article that high-quality big data is “at the heart of this massive transformation platform.”

It’s no secret: data quality is top of mind for some of the best data teams. Still, it’s one thing to write about it: how do we actually achieve this in practice?

Data reliability—an organization’s ability to deliver high data availability and health throughout the entire data life cycle—is the outcome of high data quality. As companies ingest more operational and third-party data than ever before, with employees from across the organization interacting with that data at all stages of its life cycle, it’s become increasingly important for that data to be reliable.

Data reliability has to be intentionally built into every level of your organization, from the processes and technologies you leverage to build and manage your data stack to the way you communicate and triage data issues further downstream. In this chapter, we’ll explore how to architect for data reliability at each stage of the pipeline—and data engineering experience.