Resilient distributed datasets

Spark expresses all computations as a sequence of transformations and actions on distributed collections, called Resilient Distributed Datasets (RDD). Let's explore how RDDs work with the Spark shell. Navigate to the examples directory and open a Spark shell as follows:

$ spark-shell
scala> 

Let's start by loading an email in an RDD:

scala> val email = sc.textFile("ham/9-463msg1.txt")
email: rdd.RDD[String] = MapPartitionsRDD[1] at textFile

email is an RDD, with each element corresponding to a line in the input file. Notice how we created the RDD by calling the textFile method on an object called sc:

scala> sc
spark.SparkContext = org.apache.spark.SparkContext@459bf87c

sc is a SparkContext instance, an object representing ...

Get Scala: Guide for Data Science Professionals now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.