Chapter 5: Cleaning, Transforming, and Enriching Data
In the previous two chapters, you learned how to build data pipelines that could read and write from files and databases. In many instances, these skills alone will enable you to build production data pipelines. For example, you will read files from a data lake and insert them into a database. You now have the skills to accomplish this. Sometimes, however, you will need to do something with the data after extraction but prior to loading. What you will need to do is clean the data. Cleaning is a vague term. More specifically, you will need to check the validity of the data and answer questions such as the following: Is it complete? Are the values within the proper ranges? Are the columns the ...
Get Data Engineering with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.