Chapter 7. Running on a Cluster

Introduction

Up to now, we’ve focused on learning Spark by using the Spark shell and examples that run in Spark’s local mode. One benefit of writing applications on Spark is the ability to scale computation by adding more machines and running in cluster mode. The good news is that writing applications for parallel cluster execution uses the same API you’ve already learned in this book. The examples and applications you’ve written so far will run on a cluster “out of the box.” This is one of the benefits of Spark’s higher level API: users can rapidly prototype applications on smaller datasets locally, then run unmodified code on even very large clusters.

This chapter first explains the runtime architecture of a distributed Spark application, then discusses options for running Spark in distributed clusters. Spark can run on a wide variety of cluster managers (Hadoop YARN, Apache Mesos, and Spark’s own built-in Standalone cluster manager) in both on-premise and cloud deployments. We’ll discuss the trade-offs and configurations required for running in each case. Along the way we’ll also cover the “nuts and bolts” of scheduling, deploying, and configuring a Spark application. After reading this chapter you’ll have everything you need to run a distributed Spark program. The following chapter will cover tuning and debugging applications.

Spark Runtime Architecture

Before we dive into the specifics of running Spark on a cluster, it’s helpful to understand ...

Get Learning Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.