O'Reilly logo

Spark by Brennon York, Kai Sasaki, Ema Orhian, Ilya Ganelin

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

CHAPTER 2 Cluster Management

In Chapter 1, we learned about some of the history of distributed computing, and how Spark developed within the broader data ecosystem. Much of the challenge of working with distributed systems simply comes from managing the resources that these systems need in order to run. The fundamental problem is one of scarcity—there are a finite number of sparse resources, and a multitude of applications that demand these resources to execute. Distributed computing would not be possible without a system managing these resources and scheduling their distribution.

The power of Spark is that it abstracts away much of the heavy lifting of creating parallelized applications that can easily migrate from running on a single machine to running in a distributed environment. Cluster management tools play a critical role in this abstraction. The cluster manager abstracts away the complex dance of resource scheduling and distribution from the application, in our case—Spark. This makes it possible for Spark to readily make use of the resources of one machine, ten machines, or a thousand machines, without fundamentally changing its underlying implementation.

Spark is built on layers of abstractions that ultimately resolve into a simple and intuitive interface for its users. The core abstraction of the RDD, or DataFrame, transforms what would otherwise be a multitude of data stored in a distributed environment, into a single object that masks the distributed nature of ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required