CHAPTER 2 Cluster Management

In Chapter 1, we learned about some of the history of distributed computing, and how Spark developed within the broader data ecosystem. Much of the challenge of working with distributed systems simply comes from managing the resources that these systems need in order to run. The fundamental problem is one of scarcity—there are a finite number of sparse resources, and a multitude of applications that demand these resources to execute. Distributed computing would not be possible without a system managing these resources and scheduling their distribution.

The power of Spark is that it abstracts away much of the heavy lifting of creating parallelized applications that can easily migrate from running on a single machine to running in a distributed environment. Cluster management tools play a critical role in this abstraction. The cluster manager abstracts away the complex dance of resource scheduling and distribution from the application, in our case—Spark. This makes it possible for Spark to readily make use of the resources of one machine, ten machines, or a thousand machines, without fundamentally changing its underlying implementation.

Spark is built on layers of abstractions that ultimately resolve into a simple and intuitive interface for its users. The core abstraction of the RDD, or DataFrame, transforms what would otherwise be a multitude of data stored in a distributed environment, into a single object that masks the distributed nature of ...

Get Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.