January 2019
Beginner to intermediate
154 pages
4h 31m
English
Partitions of an existing RDD can be changed using repartition() or coalesce(). These operations can redistribute the RDD based on the number of partitions provided. The repartition() can be used to increase or decrease the number of partitions, but it involves heavy data shuffling across the cluster. On the other hand, coalesce() can be used only to decrease the number of partitions. In most of the cases, coalesce() does not trigger a shuffle. The coalesce() can be used soon after heavy filtering to optimize the execution time. It is important to notice that coalesce() does not always avoid shuffling. If the number of partitions provided is much smaller than the number of available nodes in the cluster then ...
Read now
Unlock full access