July 2017
Intermediate to advanced
796 pages
18h 55m
English
groupByKey involves a lot of shuffling and reduceByKey tends to improve the performance by not sending all elements of the PairRDD using shuffles, rather using a local combiner to first do some basic aggregations locally and then send the resultant elements as in groupByKey. This greatly reduces the data transferred, as we don't need to send everything over. reduceBykey works by merging the values for each key using an associative and commutative reduce function. Of course, first, this will also perform the merging locally on each mapper before sending results to a reducer.
reduceByKey can be invoked either using a custom partitioner ...
Read now
Unlock full access