In this section, we will use keyBy() operations to reduce shuffle. We will cover the following topics:
- Loading randomly partitioned data
- Trying to pre-partition data in a meaningful way
- Leveraging the keyBy() function
We will load randomly partitioned data, but this time using the RDD API. We will repartition the data in a meaningful way and extract the information that is going on underneath, similar to DataFrame and the Dataset API. We will learn how to leverage the keyBy() function to give our data some structure and to cause the pre-partitioning in the RDD API.
Here is the test we will be using in this section. We are creating two random input records. The first record has a random user ...