March 2019
Beginner to intermediate
182 pages
4h 6m
English
In this section, we will test the operations that cause a shuffle in Apache Spark. We will cover the following topics:
A join is a specific operation that causes shuffle, and we will use it to join our two DataFrames. We will first check whether it causes shuffle and then we will check how to avoid it. To understand this, we will use two DataFrames that are partitioned differently and check the operation of joining two datasets or DataFrames that are not partitioned or partitioned randomly. It will cause shuffle because there is no way to join two datasets with the ...
Read now
Unlock full access