book

Data Quality Fundamentals

by Barr Moses, Lior Gavish, Molly Vorwerck

September 2022

Beginner to intermediate

308 pages

8h 43m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
What Is Data Quality?Framing the Current MomentUnderstanding the “Rise of Data Downtime”Other Industry Trends Contributing to the Current MomentSummary
Understanding the Difference Between Operational and Analytical DataWhat Makes Them Different?Data Warehouses Versus Data LakesData Warehouses: Table Types at the Schema LevelData Lakes: Manipulations at the File LevelWhat About the Data Lakehouse?Syncing Data Between Warehouses and LakesCollecting Data Quality MetricsWhat Are Data Quality Metrics?How to Pull Data Quality MetricsUsing Query Logs to Understand Data Quality in the WarehouseUsing Query Logs to Understand Data Quality in the LakeDesigning a Data CatalogBuilding a Data CatalogSummary
Collecting DataApplication Log DataAPI ResponsesSensor DataCleaning DataBatch Versus Stream ProcessingData Quality for Stream ProcessingNormalizing DataHandling Heterogeneous Data SourcesSchema Checking and Type CoercionSyntactic Versus Semantic Ambiguity in DataManaging Operational Data Transformations Across AWS Kinesis and Apache KafkaRunning Analytical Data TransformationsEnsuring Data Quality During ETLEnsuring Data Quality During TransformationAlerting and Testingdbt Unit TestingGreat Expectations Unit TestingDeequ Unit TestingManaging Data Quality with Apache AirflowScheduler SLAsInstalling Circuit Breakers with Apache AirflowSQL Check OperatorsSummary
Knowing Your Known Unknowns and Unknown UnknownsBuilding an Anomaly Detection AlgorithmMonitoring for FreshnessUnderstanding DistributionBuilding Monitors for Schema and LineageAnomaly Detection for Schema Changes and LineageVisualizing LineageInvestigating a Data AnomalyScaling Anomaly Detection with Python and Machine LearningImproving Data Monitoring Alerting with Machine LearningAccounting for False Positives and False NegativesImproving Precision and RecallDetecting Freshness Incidents with Data MonitoringF-ScoresDoes Model Accuracy Matter?Beyond the Surface: Other Useful Anomaly Detection ApproachesDesigning Data Quality Monitors for Warehouses Versus LakesSummary
Measuring and Maintaining High Data Reliability at IngestionMeasuring and Maintaining Data Quality in the PipelineUnderstanding Data Quality DownstreamBuilding Your Data PlatformData IngestionData Storage and ProcessingData Transformation and ModelingBusiness Intelligence and AnalyticsData Discovery and GovernanceDeveloping Trust in Your DataData ObservabilityMeasuring the ROI on Data QualityHow to Set SLAs, SLOs, and SLIs for Your DataCase Study: BlinkistSummary
Fixing Quality Issues in Software DevelopmentData Incident ManagementIncident DetectionResponseRoot Cause AnalysisResolutionBlameless PostmortemIncident Response and MitigationEstablishing a Routine of Incident ManagementWhy Data Incident Commanders MatterCase Study: Data Incident Management at PagerDutyThe DataOps Landscape at PagerDutyData Challenges at PagerDutyUsing DevOps Best Practices to Scale Data Incident ManagementSummary
Building End-to-End Field-Level Lineage for Modern Data SystemsBasic Lineage RequirementsData Lineage DesignParsing the DataBuilding the User InterfaceCase Study: Architecting for Data Reliability at FoxExercise “Controlled Freedom” When Dealing with StakeholdersInvest in a Decentralized Data TeamAvoid Shiny New Toys in Favor of Problem-Solving TechTo Make Analytics Self-Serve, Invest in Data TrustSummary
Treating Your “Data” Like a ProductPerspectives on Treating Data Like a ProductConvoy Case Study: Data as a Service or OutputUber Case Study: The Rise of the Data Product ManagerApplying the Data-as-a-Product ApproachBuilding Trust in Your Data PlatformAlign Your Product’s Goals with the Goals of the BusinessGain Feedback and Buy-in from the Right StakeholdersPrioritize Long-Term Growth and Sustainability Versus Short-Term GainsSign Off on Baseline Metrics for Your Data and How You Measure ThemKnow When to Build Versus BuyAssigning Ownership for Data QualityChief Data OfficerBusiness Intelligence AnalystAnalytics EngineerData ScientistData Governance LeadData EngineerData Product ManagerWho Is Responsible for Data Reliability?Creating Accountability for Data QualityBalancing Data Accessibility with TrustCertifying Your DataSeven Steps to Implementing a Data Certification ProgramCase Study: Toast’s Journey to Finding the Right Structure for Their Data TeamIn the Beginning: When a Small Team Struggles to Meet Data DemandsSupporting Hypergrowth as a Decentralized Data OperationRegrouping, Recentralizing, and Refocusing on Data TrustConsiderations When Scaling Your Data TeamIncreasing Data LiteracyPrioritizing Data Governance and CompliancePrioritizing a Data CatalogBeyond Catalogs: Enforcing Data GovernanceBuilding a Data Quality StrategyMake Leadership Accountable for Data QualitySet Data Quality KPIsSpearhead a Data Governance ProgramAutomate Your Lineage and Data Governance ToolingCreate a Communications PlanSummary
Building a Data Mesh for Greater Data QualityDomain-Oriented Data Owners and PipelinesSelf-Serve FunctionalityInteroperability and Standardization of CommunicationsWhy Implement a Data Mesh?To Mesh or Not to Mesh? That Is the QuestionCalculating Your Data Mesh ScoreA Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data MeshCan You Build a Data Mesh from a Single Solution?Is Data Mesh Another Word for Data Virtualization?Does Each Data Product Team Manage Their Own Separate Data Stores?Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh?Is the Data Mesh Right for All Data Teams?Does One Person on Your Team “Own” the Data Mesh?Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts?Case Study: Kolibri Games’ Data Stack JourneyFirst Data NeedsPursuing Performance Marketing2018: Professionalize and CentralizeGetting Data-OrientedGetting Data-DrivenBuilding a Data MeshFive Key Takeaways from a Five-Year Data EvolutionMaking Metadata Work for the BusinessUnlocking the Value of Metadata with Data DiscoveryData Warehouse and Lake ConsiderationsData Catalogs Can Drown in a Data Lake—or Even a Data MeshMoving from Traditional Data Catalogs to Modern Data DiscoveryDeciding When to Get Started with Data Quality at Your CompanyYou’ve Recently Migrated to the CloudYour Data Stack Is Scaling with More Data Sources, More Tables, and More ComplexityYour Data Team Is GrowingYour Team Is Spending at Least 30% of Their Time Firefighting Data Quality IssuesYour Team Has More Data Consumers Than They Did One Year AgoYour Company Is Moving to a Self-Service Analytics ModelData Is a Key Part of the Customer Value PropositionData Quality Starts with TrustSummary

Be Proactive, Not ReactivePredictions for the Future of Data Quality and ReliabilityData Warehouses and Lakes Will MergeEmergence of New Roles on the Data TeamRise of AutomationMore Distributed Environments and the Rise of Data DomainsSo Where Do We Go from Here?

Content preview from Data Quality Fundamentals

Chapter 10. Pioneering the Future of Reliable Data Systems

If Data Quality Fundamentals taught you anything about the larger state of analytics and data engineering, it’s likely that data as an industry is going through a massive, irreversible sea change.

Only five years ago, it wasn’t uncommon for data to live in siloes, accessed only by functional teams on an ad hoc basis for discrete tasks such as understanding how internal systems were being used, for example, or perhaps querying data about application usage over time. Now, analytical data is turning into the modern business’s most critical and competitive form of currency. It’s no longer a matter of if your company relies on data, but how much and for what use cases.

Still, it’s simply not enough to collect more data; you also have to trust it. Solutions like cloud data warehouses and lakes, data catalogs, open source testing frameworks, and data observability solutions are building out additional features and functionalities to bring data reliability to the center of the conversation. Warehouses like Snowflake and Redshift make it easy to pull data quality metrics for freshness and volume, while open source tools like dbt and Great Expectations enable practitioners to quickly unit test their more critical data sets. Even catalogs like Alation and Collibra can provide some insight into data integrity and discovery at static points in time.

While these exciting new technologies have given data engineering teams more leverage ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781098112035Errata Page Supplemental Content

Data Quality Fundamentals

by Barr Moses, Lior Gavish, Molly Vorwerck

Chapter 10. Pioneering the Future of Reliable Data Systems

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Data Governance: The Definitive Guide

Storytelling with Data: A Data Visualization Guide for Business Professionals

Financial Data Engineering

Fundamentals of Data Engineering

Publisher Resources

Chapter 10. Pioneering the Future of Reliable Data Systems

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Data Governance: The Definitive Guide

Storytelling with Data: A Data Visualization Guide for Business Professionals

Financial Data Engineering

Fundamentals of Data Engineering

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.