Skip to Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Appendix B. Spark RDDs

Apache Spark is a “fast and general-purpose cluster computing system.” Its main abstraction is a distributed collection of items (such as log records, FASTQ sequences, or employee records) called a resilient distributed data set (RDD). We can create RDDs from Hadoop InputFormats (such as HDFS files), by transforming other RDDs, or by transforming “collection” data structures (such as Lists and Maps). RDDs can also be created from Java/Scala collection objects as well as other persistent data stores. The main purpose of an RDD is to support higher-level, parallel operations on data through a simple API (such as JavaRDD and JavaPairRDD).

This appendix will introduce Spark RDDs through simple Java examples. Its purpose is not to dive into the architectural[35] details of RDDs, but merely to show you how to utilize RDDs in MapReduce or general-purpose programs (as directed acyclic graphs, or DAGs). You can consider an RDD a handle for a collection of items of type T, which are the result of some computation. Type T can be any standard Java data type (such as String, Integer, Map, or List) or any custom objects (such as Employee or Mutation). A Spark RDD performs actions (such as reduce(), collect(), count(), and saveAsTextFile()) and transformations (such as map(), filter(), union(), groupByKey(), and reduceByKey()), which can be used for more complex computations. All the examples provided here are based on Spark-1.1.0.

Typically, Spark programs can run faster ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Graph Algorithms

Graph Algorithms

Mark Needham, Amy E. Hodler
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content