Chapter 6. Clusters

I have a very large army and very large dragons.

—Daenerys Targaryen

Previous chapters focused on using Spark over a single computing instance, your personal computer. In this chapter, we introduce techniques to run Spark over multiple computing instances, also known as a computing cluster. This chapter and subsequent ones will introduce and make use of concepts applicable to computing clusters; however, it’s not required to use a computing cluster to follow along, so you can still use your personal computer. It’s worth mentioning that while previous chapters focused on single computing instances, you can also use all the data analysis and modeling techniques we presented in a computing cluster without changing any code.

If you already have a Spark cluster in your organization, you could consider skipping to Chapter 7, which teaches you how to connect to an existing cluster. Otherwise, if you don’t have a cluster or are considering improvements to your existing infrastructure, this chapter introduces the cluster trends, managers, and providers available today.

Overview

There are three major trends in cluster computing worth discussing: on-premises, cloud computing, and Kubernetes. Framing these trends over time will help us understand how they came to be, what they are, and what their future might be. To illustrate this, Figure 6-1 plots these trends over time using data from Google trends.

For on-premises clusters, you or someone in your organization purchased ...

Get Mastering Spark with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.