Resilient distributed datasets
Spark expresses all computations as a sequence of transformations and actions on distributed collections, called Resilient Distributed Datasets (RDD). Let's explore how RDDs work with the Spark shell. Navigate to the examples directory and open a Spark shell as follows:
$ spark-shell scala>
Let's start by loading an email in an RDD:
scala> val email = sc.textFile("ham/9-463msg1.txt") email: rdd.RDD[String] = MapPartitionsRDD[1] at textFile
email
is an RDD, with each element corresponding to a line in the input file. Notice how we created the RDD by calling the textFile
method on an object called sc
:
scala> sc spark.SparkContext = org.apache.spark.SparkContext@459bf87c
sc
is a SparkContext
instance, an object representing ...
Get Scala: Guide for Data Science Professionals now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.