Chapter 2. Assembling the Building Blocks of a Reliable Data System

While solving data quality issues in production is a critical skill set for any data practitioner, data downtime can often be prevented almost entirely with the right systems and processes in place.

Like software, data can rely on any number of operational, programmatic, or even data-related influences at various stages in the pipeline, and all it takes is one schema change or code push to send a downstream report into disarray.

As we’ll discuss in Chapter 8, solving for data quality and building more reliable pipelines is broken into three key components: process, technologies, and people. In this chapter, we’ll tackle the technology component of this equation, mapping together the disparate pieces of the data pipeline and what it takes to measure, fix, and prevent data downtime at each step.

Data systems are ridiculously complex, with various stages in the data pipeline contributing to this chaos. And as companies increasingly invest in data and analytics, the pressure to build at scale puts serious pressure on data engineers to account for quality before data even enters the pipeline.

In this chapter, we’ll highlight the various metadata-powered building blocks—from data catalogs to data warehouses and lakes—to ensure your data infrastructure is set up for success when it comes to ensuring high-quality data at each stage of the pipeline.

Understanding the Difference Between Operational ...

Get Data Quality Fundamentals now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.