Hadoop and Spark Performance for the Enterprise: Ensuring Quality of Service in Multi-Tenant Environments
Modern Hadoop and Spark environments are busy places. Multiple applications being run by multiple users with wildly different workloads (HIVE queries, for instance, cheek-by-jowl with long MapReduce jobs) are contending for the same resources. And users are noticing the problems that result from contention: companies spend big bucks on hardware or on virtual machines (VMs) in the cloud, and don’t get the results in the time they need.
Luckily, you can solve this without throwing in more and more money and overprovisioning hardware resources. Instead, you can aim for Quality of Service (QoS) in mixed workload, multitenant Hadoop and Spark environments. Throughout this report, I will use the term distributed processing to refer to modern Big Data analysis tools such as Hadoop, Spark, and HIVE. It’s a very general term that covers long-running jobs such as MapReduce, fast-running in-memory Spark jobs that are often called “real-time,” and other tools in the Hadoop universe.
Let’s take a look at the waste left by distributed processing tasks. When developers submit a distributed processing job, they need to specify the amount of CPU required (by specifying the size of the system), the amount of memory to use, and other necessary parameters. But hardware requirements (CPU, network, memory, and so on) can change after the job is running. The performance company Pepperdata, for instance, ...