July 2017
Intermediate to advanced
796 pages
18h 55m
English
ShuffledRDD shuffles the RDD elements by key so as to accumulate values for the same key on the same executor to allow an aggregation or combining logic. A very good example is to look at what happens when reduceByKey() is called on a PairRDD:
class ShuffledRDD[K, V, C] extends RDD[(K, C)]
The following is a reduceByKey operation on the pairRDD to aggregate the records by the State:
scala> val pairRDD = statesPopulationRDD.map(record => (record.split(",")(0), 1))pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[82] at map at <console>:27scala> pairRDD.take(5)res101: Array[(String, Int)] = Array((State,1), (Alabama,1), (Alaska,1), (Arizona,1), (Arkansas,1))scala> val shuffledRDD = pairRDD.reduceByKey(_+_)shuffledRDD: ...Read now
Unlock full access