Resilient distributed datasets

Spark expresses all computations as a sequence of transformations and actions on distributed collections, called Resilient Distributed Datasets (RDD). Let's explore how RDDs work with the Spark shell. Navigate to the examples directory and open a Spark shell as follows:

$ spark-shell

Let's start by loading an email in an RDD:

scala> val email = sc.textFile("ham/9-463msg1.txt")
email: rdd.RDD[String] = MapPartitionsRDD[1] at textFile

email is an RDD, with each element corresponding to a line in the input file. Notice how we created the RDD by calling the textFile method on an object called sc:

scala> sc
spark.SparkContext = org.apache.spark.SparkContext@459bf87c

sc is a SparkContext instance, an object representing ...

Get Scala for Data Science now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.