Chapter 3. Collecting, Cleaning, Transforming, and Testing Data
Now that we have a better understanding of the various tools necessary to prioritize data reliability, letâs discuss how to ready your data for production use cases with data quality in mind.
In ChapterÂ 2, we discussed some of the domain terminology and walked through a taxonomy of where data quality nuggets (mostly metadata) are to be found. Still, to get a thorough sense of data quality in your data pipeline, you need to look end to end, at the entire life cycle of data as it persists at your organization.
In this chapter, weâll walk through how to manage data before and while itâs in the pipeline through four key steps that impact overall data quality: data collection, cleaning, transformation, and testing. While data collection and cleaning concern the first step of the production pipeline, transformation and testing tackle data quality while itâs midway through its journey to becoming actionable analytics.
When it comes to collecting data, perhaps no aspect of the pipeline is as important as the entrypoint, the most upstream location in any data pipeline. We define an entrypoint as an initial point of contact where data from the outside world enters your pipeline. If youâre familiar with Docker containerization, you might be familiar with the
ENTRYPOINT keyword. This is the initial command run whenever we start a container. Likewise, âentrypointâ in software engineering ...