O'Reilly logo

Apache Spark for Data Science Cookbook by Padma Priya Chitturi

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Working with pair RDDs

This recipe shows how to work with RDDs of key/value pairs. Key/value RDDs are often widely used to perform aggregations. These key/value RDDs are called pair RDDs. We'll do some initial ETL to get the data into a key/value format and see how to apply transformations on single-pair RDDs and two-pair RDDs.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the other distributed modes, that is, standalone, YARN, or Mesos. It could be run in local mode as well.

How to do it…

  1. We can create a pair RDD from a collection of strings in the following way:
     val baseRdd = sc.parallelize(Array("this,is,a,ball","it,is,a,cat","john,is, in,town,hall")) val inputRdd ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required