Appendix B. Spark RDDs

Apache Spark is a “fast and general-purpose cluster computing system.” Its main abstraction is a distributed collection of items (such as log records, FASTQ sequences, or employee records) called a resilient distributed data set (RDD). We can create RDDs from Hadoop InputFormats (such as HDFS files), by transforming other RDDs, or by transforming “collection” data structures (such as Lists and Maps). RDDs can also be created from Java/Scala collection objects as well as other persistent data stores. The main purpose of an RDD is to support higher-level, parallel operations on data through a simple API (such as JavaRDD and JavaPairRDD).

This appendix will introduce Spark RDDs through simple Java examples. Its purpose is not to dive into the architectural[35] details of RDDs, but merely to show you how to utilize RDDs in MapReduce or general-purpose programs (as directed acyclic graphs, or DAGs). You can consider an RDD a handle for a collection of items of type T, which are the result of some computation. Type T can be any standard Java data type (such as String, Integer, Map, or List) or any custom objects (such as Employee or Mutation). A Spark RDD performs actions (such as reduce(), collect(), count(), and saveAsTextFile()) and transformations (such as map(), filter(), union(), groupByKey(), and reduceByKey()), which can be used for more complex computations. All the examples provided here are based on Spark-1.1.0.

Typically, Spark programs can run faster ...

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.