6

Understanding Data Transformation

One of the main jobs of any data engineer is to transform data in some way to make it usable for Business Intelligence (BI) applications or for data scientists or analysts. In Chapter 3, you learned the basics of a Spark application and how to ingest data.

Now, in this chapter, we are going to dive a bit deeper and look at some advanced topics that are essential for any data engineer to understand when using Spark to build data pipelines.

Here is a list of them:

  • Understanding the difference between transformations and actions
  • Learning how to aggregate, group, and join data
  • Leveraging advanced window functions
  • Working with complex dataset types

Technical requirements

All of the code and data for this chapter ...

Get Data Engineering with Scala and Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.