Chapter 2. Clusters

Before we perform a deep dive into modern cluster infrastructure, this chapter will consider clusters from a wider perspective, showing how multiple modern data platforms fit together within the enterprise context.

First, we dispel the myth of the single cluster, describing how and why organizations choose to deploy multiple clusters. We then briefly look at the black art of cluster sizing and cluster growth, and finally at the data replication implications of deploying multiple clusters.

Reasons for Multiple Clusters

The aspiration to have a single large cluster that stores everything and removes data silos is tantalizing to many organizations, but the reality is that multiple clusters are inevitable—particularly within an enterprise setting. As we describe in this section, there are many valid reasons for deploying multiple clusters, and they all have one thing in common: the need for independence.

Multiple Clusters for Resiliency

Architecting a system for resilient operation involves ensuring that elements are highly available, and designing out any single points of failure such as power or cooling, as discussed in Chapters 6 and 12. Ultimately, every cluster sits within a single point of failure simply due to geography—even a cloud deployment built using multiple availability zones (AZs).

Total system resiliency can therefore only be assured by using multiple datacenters in different geographic regions, ensuring that business processes can withstand even ...

Get Architecting Modern Data Platforms now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.