Skip to Main Content
Apache Spark Quick Start Guide
book

Apache Spark Quick Start Guide

by Shrey Mehrotra, Akash Grade
January 2019
Beginner to intermediate content levelBeginner to intermediate
154 pages
4h 31m
English
Packt Publishing
Content preview from Apache Spark Quick Start Guide

Programming using RDDs

An RDD can be created in four ways:

  • Parallelize a collection: This is one of the easiest ways to create an RDD. You can use the existing collection from your programs, such as List, Array, or Set, as well as others, and ask Spark to distribute that collection across the cluster to process it in parallel. A collection can be distributed with the help of  parallelize(), as shown here:
#PythonnumberRDD = spark.sparkContext.parallelize(range(1,10))numberRDD.collect()Out[4]: [1, 2, 3, 4, 5, 6, 7, 8, 9]

 The following code performs the same operation in Scala:

//scalaval numberRDD = spark.sparkContext.parallelize(1 to 10)numberRDD.collect()res4: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  • From an external dataset ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Debugging Apache Spark

Debugging Apache Spark

Holden Karau
Apache Spark 2.x for Java Developers

Apache Spark 2.x for Java Developers

Sumit Kumar, Sourav Gulati
Stream Processing with Apache Spark

Stream Processing with Apache Spark

Gerard Maas, Francois Garillot

Publisher Resources

ISBN: 9781789349108Supplemental Content