Resilient distributed datasets
Spark expresses all computations as a sequence of transformations and actions on distributed collections, called Resilient Distributed Datasets (RDD). Let's explore how RDDs work with the Spark shell. Navigate to the examples directory and open a Spark shell as follows:
$ spark-shell scala>
Let's start by loading an email in an RDD:
scala> val email = sc.textFile("ham/9-463msg1.txt") email: rdd.RDD[String] = MapPartitionsRDD[1] at textFile
email
is an RDD, with each element corresponding to a line in the input file. Notice how we created the RDD by calling the textFile
method on an object called sc
:
scala> sc spark.SparkContext = org.apache.spark.SparkContext@459bf87c
sc
is a SparkContext
instance, an object representing ...
Get Scala:Applied Machine Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.