Chapter 3: Data Cleansing and Integration
In the previous chapter, you were introduced to the first step of the data analytics process – that is, ingesting raw, transactional data from various source systems into a cloud-based data lake. Once we have the raw data available, we need to process, clean, and transform it into a format that helps with extracting meaningful, actionable business insights. This process of cleaning, processing, and transforming raw data is known as data cleansing and integration. This is what you will learn about in this chapter.
Raw data sourced from operational systems is not conducive for data analytics in its raw format. In this chapter, you will learn about various data integration techniques, which are useful in ...
Get Essential PySpark for Scalable Data Analytics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.