Word count on RDD

Let's run a word count problem on stringRDD. Word count is the HelloWorld of the big data world. Word count means that we will count the occurrence of each word in the RDD:

So first we will create pairRDD as follows:

scala>valpairRDD=stringRdd.map( s => (s,1))pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at <console>:26

The pairRDD consists of pairs of the word and one (integer) where word represents strings of stringRDD.

Now, we will run the reduceByKey operation on this RDD to count the occurrence of each word as follows:

scala>valwordCountRDD=pairRDD.reduceByKey((x,y) =>x+y)wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28

Now, let's ...

Get Apache Spark 2.x for Java Developers now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.