Apache Spark is a “fast and general-purpose cluster computing system.” Its main abstraction is a distributed collection of items (such as log records, FASTQ sequences, or employee records) called a resilient distributed data set (RDD). We can create RDDs from Hadoop
InputFormats (such as HDFS files), by transforming other RDDs, or by transforming “collection” data structures (such as
Maps). RDDs can also be created from Java/Scala collection objects as well as other persistent data stores. The main purpose of an RDD is to support higher-level, parallel operations on data through a simple API (such as
This appendix will introduce Spark RDDs through simple Java examples. Its purpose is not to dive into the architectural details of RDDs, but merely to show you how to utilize RDDs in MapReduce or general-purpose programs (as directed acyclic graphs, or DAGs). You can consider an RDD a handle for a collection of items of type
T, which are the result of some computation. Type
T can be any standard Java data type (such as
List) or any custom objects (such as
Mutation). A Spark RDD performs actions (such as
saveAsTextFile()) and transformations (such as
reduceByKey()), which can be used for more complex computations. All the examples provided here are based on Spark-1.1.0.
Typically, Spark programs can run faster ...