
Introducing Spark andKafka | 129
S. No. RDD Transformations and Meaning
9. distinct([numTasks])
Returns a new dataset that contains the distinct elements of the source dataset.
Example:
val rdd1 = park.sparkContext.
parallelize(Seq((1,”jan”,2016),(3,”nov”,2014),(16,”feb”,2014),(3,”nov”,2014)))
val result = rdd1.distinct()
println(result.collect().mkString(“, “))
Transformations on Pair RDD:
S. No Pair RDD Transformations and Meaning
1. groupByKey([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or
average) over each key, using ...