Chapter 2. Downloading Apache Spark and Getting Started

In this chapter, we will get you set up with Spark and walk through three simple steps you can take to get started writing your first standalone application.

We will use local mode, where all the processing is done on a single machine in a Spark shell—this is an easy way to learn the framework, providing a quick feedback loop for iteratively performing Spark operations. Using a Spark shell, you can prototype Spark operations with small data sets before writing a complex Spark application, but for large data sets or real work where you want to reap the benefits of distributed execution, local mode is not suitable—you’ll want to use the YARN or Kubernetes deployment modes instead.

While the Spark shell only supports Scala, Python, and R, you can write a Spark application in any of the supported languages (including Java) and issue queries in Spark SQL. We do expect you to have some familiarity with the language of your choice.

Step 1: Downloading Apache Spark

To get started, go to the Spark download page, select “Pre-built for Apache Hadoop 2.7” from the drop-down menu in step 2, and click the “Download Spark” link in step 3 (Figure 2-1).

The Apache Spark download page
Figure 2-1. The Apache Spark download page

This will download the tarball spark-3.0.0-preview2-bin-hadoop2.7.tgz, which contains all the Hadoop-related binaries you will need to run Spark ...

Get Learning Spark, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.