O'Reilly logo

Data Algorithms by Mahmoud Parsian

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Appendix B. Spark RDDs

Apache Spark is a “fast and general-purpose cluster computing system.” Its main abstraction is a distributed collection of items (such as log records, FASTQ sequences, or employee records) called a resilient distributed data set (RDD). We can create RDDs from Hadoop InputFormats (such as HDFS files), by transforming other RDDs, or by transforming “collection” data structures (such as Lists and Maps). RDDs can also be created from Java/Scala collection objects as well as other persistent data stores. The main purpose of an RDD is to support higher-level, parallel operations on data through a simple API (such as JavaRDD and JavaPairRDD).

This appendix will introduce Spark RDDs through simple Java examples. Its purpose is not to dive into the architectural[35] details of RDDs, but merely to show you how to utilize RDDs in MapReduce or general-purpose programs (as directed acyclic graphs, or DAGs). You can consider an RDD a handle for a collection of items of type T, which are the result of some computation. Type T can be any standard Java data type (such as String, Integer, Map, or List) or any custom objects (such as Employee or Mutation). A Spark RDD performs actions (such as reduce(), collect(), count(), and saveAsTextFile()) and transformations (such as map(), filter(), union(), groupByKey(), and reduceByKey()), which can be used for more complex computations. All the examples provided here are based on Spark-1.1.0.

Typically, Spark programs can run faster ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required