Programming using RDDs

An RDD can be created in four ways:

  • Parallelize a collection: This is one of the easiest ways to create an RDD. You can use the existing collection from your programs, such as List, Array, or Set, as well as others, and ask Spark to distribute that collection across the cluster to process it in parallel. A collection can be distributed with the help of  parallelize(), as shown here:
#PythonnumberRDD = spark.sparkContext.parallelize(range(1,10))numberRDD.collect()Out[4]: [1, 2, 3, 4, 5, 6, 7, 8, 9]

 The following code performs the same operation in Scala:

//scalaval numberRDD = spark.sparkContext.parallelize(1 to 10)numberRDD.collect()res4: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  • From an external dataset ...

Get Apache Spark Quick Start Guide now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.