Shuffling
Whatever the partitioner used, many operations will cause a repartitioning of data across the partitions of an RDD. New partitions can be created or several partitions can be collapsed/coalesced. All the data movement necessary for the repartitioning is called shuffling, and this is an important concept to understand when writing a Spark Job. The shuffling can cause a lot of performance lag as the computations are no longer in memory on the same executor but rather the executors are exchanging data over the wire.
A good example is the example of groupByKey(), we saw earlier in the Aggregations section. Obviously, lot of data was flowing between executors to make sure all values for a key are collected onto the same executor to perform ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access