Chapter 1. Introduction to Multi-Tenant Distributed Systems
The Benefits of Distributed Systems
The past few decades have seen an explosion of computing power. Search engines, social networks, cloud-based storage and computing, and similar services now make seemingly infinite amounts of information and computation available to users across the globe.
The tremendous scale of these services would not be possible without distributed systems. Distributed systems make it possible for many hundreds or thousands of relatively inexpensive computers to communicate with one another and work together, creating the outward appearance of a single, high-powered computer. The primary benefit of a distributed system is clear: the ability to massively scale computing power relatively inexpensively, enabling organizations to scale up their businesses to a global level in a way that was not possible even a decade ago.
Performance Problems in Distributed Systems
As more and more nodes are added to the distributed system and interact with one another, and as more and more developers write and run applications on the system, complications arise. Operators of distributed systems must address an array of challenges that affect the performance of the system as a whole as well as individual applications’ performance.
These performance challenges are different from those faced when operating a data center of computers that are running more or less independently, such as a web server farm. In a true distributed system, applications are split into smaller units of work, which are spread across many nodes and communicate with one another either directly or via shared input/output data.
Additional performance challenges arise with multi-tenant distributed systems, in which different users, groups, and possibly business units run different applications on the same cluster. (This is in contrast to a single, large distributed application, such as a search engine, which is quite complex and has intertask dependencies but is still just one overall application.) These challenges that come with multitenancy result from the diversity of applications running together on any node as well as the fact that the applications are written by many different developers instead of one engineering team focused on ensuring that everything in a single distributed application works well together.
One of the primary challenges in a distributed system is in scheduling jobs and their component processes. Computing power might be quite large, but it is always finite, and the distributed system must decide which jobs should be scheduled to run where and when, and the relative priority of those jobs. Even sophisticated distributed-system schedulers have limitations that can lead to underutilization of cluster hardware, unpredictable job run times, or both. Examples include assuming the worst-case resource usage to avoid overcommitting, failing to plan for different resource types across different applications, and overlooking one or more dependencies, thus causing deadlock or starvation.
The scheduling challenges become more severe on multi-tenant clusters, which add fairness of resource access among users as a scheduling goal, in addition to (and often in conflict with) the goals of high overall hardware utilization and predictable run times for high-priority applications. Aside from the challenge of balancing utilization and fairness, in some extreme cases the scheduler might go too far in trying to ensure fairness, scheduling just a few tasks from many jobs for many users at once. This can result in latency for every job on the cluster and cause the cluster to use resources inefficiently because the system is trying to do too many disparate things at the same time.
Beyond scheduling challenges, there are many ways a distributed system can suffer from hardware bottlenecks and other inefficiencies. For example, a single job can saturate the network or disk I/O, slowing down every other job. These potential problems are only exacerbated in a multi-tenant environment—usage of a given hardware resource such as CPU or disk is often less efficient when a node has many different processes running on it. In addition, operators cannot tune the cluster for a particular access pattern, because the access patterns are both diverse and constantly changing. (Again, contrast this situation with a farm of servers, each of which is independently running a single application, or a large cluster running a single coherently designed and tuned application like a search engine.)
Distributed systems are also subject to performance problems due to bottlenecks from centralized services used by every node in the system. One common example is the master node performing job admission and scheduling; others include the master node for a distributed file system storing data for the cluster as well as common services like domain name system (DNS) servers.
These potential performance challenges are exacerbated by the fact that a primary design goal for many modern distributed systems is to enable large numbers of developers, data scientists, and analysts to use the system simultaneously. This is in stark contrast to earlier distributed systems such as high-performance computing (HPC) systems in which the only people who could write programs to run on the cluster had a systems programming background. Today, distributed systems are opening up enormous computing power to people without a systems background, so they often don’t understand or even think about system performance. Such a user might easily write a job that accidentally brings a cluster to its knees, affecting every other job and user.
Lack of Visibility Within Multi-Tenant Distributed Systems
Because multi-tenant distributed systems simultaneously run many applications, each with different performance characteristics and written by different developers, it can be difficult to determine what’s going on with the system, whether (and why) there’s a problem, which users and applications are the cause of any problem, and what to do about such problems.
Traditional cluster monitoring systems are generally limited to tracking metrics at the node level; they lack visibility into detailed hardware usage by each process. Major blind spots can result—when there’s a performance problem, operators are unable to pinpoint exactly which application caused it, or what to do about it. Similarly, application-level monitoring systems tend to focus on overall application semantics (overall run times, data volumes, etc.) and do not drill down to performance-level metrics for actual hardware resources on each node that is running a part of the application.
Truly useful monitoring for multi-tenant distributed systems must track hardware usage metrics at a sufficient level of granularity for each interesting process on each node. Gathering, processing, and presenting this data for large clusters is a significant challenge, in terms of both systems engineering (to process and store the data efficiently and in a scalable fashion) and the presentation-level logic and math (to present it usefully and accurately). Even for limited, node-level metrics, traditional monitoring systems do not scale well on large clusters of hundreds to thousands of nodes.
The Impact on Business from Performance Problems
The performance challenges described in this book can easily lead to business impacts such as the following:
- Inconsistent, unpredictable application run times
Batch jobs might run late, interactive applications might respond slowly, and the ingestion and processing of new incoming data for use by other applications might be delayed.
- Underutilized hardware
Job queues can appear full even when the cluster hardware is not running at full capacity. This inefficiency can result in higher capital and operating expenses; it can also result in significant delays for new projects due to insufficient hardware, or even the need to build out new data-center space to add new machines for additional processing power.
- Cluster instability
In extreme cases, nodes can become unresponsive or a distributed file system (DFS) might become overloaded, so applications cannot run or are significantly delayed in accessing data.
Aside from these obvious effects, performance problems also cause businesses to suffer in subtler but ultimately more significant ways. Organizations might informally “learn” that a multi-tenant cluster is unpredictable and build implicit or explicit processes to work around the unpredictability, such as the following:
Limit cluster access to a subset of developers or analysts, out of a concern that poorly written jobs will slow down or even crash the cluster for everyone.
Build separate clusters for different groups or different workloads so that the most important applications are insulated from others. Doing so increases overall cost due to inefficiency in resource usage, adds operational overhead and cost, and reduces the ability to share data across groups.
Set up “development” and “production” clusters, with a committee or other cumbersome process to approve jobs before they can be run on a production cluster. Adding these hurdles can dramatically hinder innovation, because they significantly slow the feedback loop of learning from production data, building and testing a new model or new feature, deploying it to production, and learning again.1
These responses to unpredictable performance can limit a business’s ability to fully benefit from the potential of distributed systems. Eliminating performance problems on the cluster can improve performance of the business overall.
Scope of This Book
In this book, we consider the performance challenges that arise from scheduling inefficiencies, hardware bottlenecks, and lack of visibility. We examine each problem in detail and present solutions that organizations use today to overcome these challenges and benefit from the tremendous scale and efficiency of distributed systems.
Hadoop: An Example Distributed System
This book uses Hadoop as an example of a multi-tenant distributed system. Hadoop serves as an ideal example of such a system because of its broad adoption across a variety of industries, from healthcare to finance to transportation. Due to its open source availability and a robust ecosystem of supporting applications, Hadoop’s adoption is increasing among small and large organizations alike.
Hadoop is also an ideal example because it is used in highly multi-tenant production deployments (running jobs from many hundreds of developers) and is often used to simultaneously run large batch jobs, real-time stream processing, interactive analysis, and customer-facing databases. As a result, it suffers from all of the performance challenges described herein.
Of course, Hadoop is not the only important distributed system; a few other examples include the following:2
Classic HPC clusters using MPI, TORQUE, and Moab
Distributed databases such as Oracle RAC, Teradata, Cassandra, and MongoDB
Render farms used for animation
Simulation systems used for physics and manufacturing
Throughout the book, we use the following sets of terms interchangeably:
- Application or job
A program submitted by a particular user to be run on a distributed system. (In some systems, this might be termed a query.)
- Container or task
An atomic unit of work that is part of a job. This work is done on a single node, generally running as a single (sometimes multithreaded) process on the node.
- Host, machine, or node
A single computing node, which can be an actual physical computer or a virtual machine.
1 We saw an example of the benefits of having an extremely short feedback loop at Yahoo in 2006–2007, when the sponsored search R&D team was an early user of the very first production Hadoop cluster anywhere. By moving to Hadoop and being able to deploy new click prediction models directly into production, we increased the number of simultaneous experiments by five times or more and reduced the feedback loop time by a similar factor. As a result, our models could improve an order of magnitude faster, and the revenue gains from those improvements similarly compounded that much faster.
2 Various distributed systems are designed to make different tradeoffs among Consistency, Availability, and Partition tolerance. For more information, see Gilbert, Seth, and Nancy Ann Lynch. “Perspectives on the CAP Theorem.” Institute of Electrical and Electronics Engineers, 2012 (http://hdl.handle.net/1721.1/79112) and https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed.