Chapter 3. Collecting, Cleaning, Transforming, and Testing Data

Now that we have a better understanding of the various tools necessary to prioritize data reliability, let’s discuss how to ready your data for production use cases with data quality in mind.

In Chapter 2, we discussed some of the domain terminology and walked through a taxonomy of where data quality nuggets (mostly metadata) are to be found. Still, to get a thorough sense of data quality in your data pipeline, you need to look end to end, at the entire life cycle of data as it persists at your organization.

In this chapter, we’ll walk through how to manage data before and while it’s in the pipeline through four key steps that impact overall data quality: data collection, cleaning, transformation, and testing. While data collection and cleaning concern the first step of the production pipeline, transformation and testing tackle data quality while it’s midway through its journey to becoming actionable analytics.

Collecting Data

When it comes to collecting data, perhaps no aspect of the pipeline is as important as the entrypoint, the most upstream location in any data pipeline. We define an entrypoint as an initial point of contact where data from the outside world enters your pipeline. If you’re familiar with Docker containerization, you might be familiar with the ENTRYPOINT keyword. This is the initial command run whenever we start a container. Likewise, “entrypoint” in software engineering parlance ...

Get Data Quality Fundamentals now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.