5 Data frame gymnastics: Joining and grouping

This chapter covers

  • Joining two data frames together
  • Selecting the right type of join for your use case
  • Grouping data and understanding the GroupedData transitional object
  • Breaking the GroupedData with an aggregation method
  • Filling null values in your data frame

In chapter 4, we looked at how we can transform a data frame using selection, dropping, creation, renaming, reordering, and creating a summary of columns. Those operations constitute the foundation for working with a data frame in PySpark. In this chapter, I will complete the review of the most common operations you will perform on a data frame: linking or joining data frames, as well as grouping data (and performing operations on the

Get Data Analysis with Python and PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.