Programming using RDDs

An RDD can be created in four ways:

  • Parallelize a collection: This is one of the easiest ways to create an RDD. You can use the existing collection from your programs, such as List, Array, or Set, as well as others, and ask Spark to distribute that collection across the cluster to process it in parallel. A collection can be distributed with the help of  parallelize(), as shown here:
#PythonnumberRDD = spark.sparkContext.parallelize(range(1,10))numberRDD.collect()Out[4]: [1, 2, 3, 4, 5, 6, 7, 8, 9]

 The following code performs the same operation in Scala:

//scalaval numberRDD = spark.sparkContext.parallelize(1 to 10)numberRDD.collect()res4: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  • From an external dataset ...

Get Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.