Chapter 4. Big Data Architecture and Infrastructure

As noted in O’Reilly’s 2015 Data Science Salary Survey, the same four tools—SQL, Excel, R, and Python—continue to be the most widely used in data science for the third year in a row. Spark also continues to be one of the most active projects in big data, seeing a 17% increase in users over the past 12 months. Matei Zaharia, creator of Spark, outlined in his keynote at Strata + Hadoop San Jose two new goals Spark was pursuing in 2015. The first goal was to make distributed processing tools accessible to a wide range of users, beyond big data engineers. An example of this is seen in the new DataFrames API, inspired by R and Python data frames. The second goal was to enhance integration—to allow Spark to interact efficiently in different environments, from NoSQL stores to traditional data warehouses.  

In many ways, the two goals for Spark in 2015—greater accessibility for a wider user base and greater integration of tools/environments—are consistent with the changes we’re seeing in architecture and infrastructure across the entire big data landscape. In this chapter, we present a collection of blog posts that reflect these changes. 

Ben Lorica documents what startups like Tamr and Trifacta have learned about opening up data analysis to non-programmers. Benjamin Hindman laments the fact that we still don’t have an operating system that abstracts and manages hardware resources in the data center. Jim Scott discusses his use of Myriad to enable Mesos and YARN to work better together (but notes improvements are still needed). Yanpei Chen chronicles what happened when his team at Cloudera used SSDs instead of HDDs to assess their impact on big data (sneak preview: SSDs again offer up to 40% shorter job duration, and 70% higher performance). Finally, Shaoshan Liu discusses how to use Baidu and Tachyon with Spark SQL to increase data processing speed 30-fold.

Lessons from Next-Generation Data-Wrangling Tools

One of the trends we’re following is the rise of applications that combine big data, algorithms, and efficient user interfaces. As I noted in “Big Data’s Big Ideas,” our interest stems from both consumer apps as well as tools that democratize data analysis. It’s no surprise that one of the areas where “cognitive augmentation” is playing out is in data preparation and curation. Data scientists continue to spend a lot of their time on data wrangling, and the increasing number of (public and internal) data sources paves the way for tools that can increase productivity in this critical area.

At Strata + Hadoop World New York, two presentations from academic spinoff startups focused on data preparation and curation:

While data wrangling is just one component of a data science pipeline, and granted we’re still in the early days of productivity tools in data science, some of the lessons these companies have learned extend beyond data preparation.

Scalability ~ Data Variety and Size

Not only are enterprises faced with many data stores and spreadsheets, data scientists have many more (public and internal) data sources they want to incorporate. The absence of a global data model means integrating data silos, and data sources require tools for consolidating schemas.

Random samples are great for working through the initial phases, particularly while you’re still familiarizing yourself with a new data set. Trifacta lets users work with samples while they’re developing data-wrangling “scripts” that can be used on full data sets.

Empower Domain Experts

In many instances, you need subject area experts to explain specific data sets that you’re not familiar with. These experts can place data in context and are usually critical in helping you clean and consolidate variables. Trifacta has tools that enable non-programmers to take on data-wrangling tasks that previously required a fair amount of scripting.

Consider DSLs and Visual Interfaces

Programs written in a [domain specific language] (DSL) also have one other important characteristic: they can often be written by non-programmers…a user immersed in a domain already knows the domain semantics. All the DSL designer needs to do is provide a notation to express that semantics.

Paul Hudak, 1997

I’ve often used regular expressions for data wrangling, only to come back later unable to read the code I wrote (Joe Hellerstein describes regex as “meant for writing and never reading again”). Programs written in DSLs are concise, easier to maintain, and can often be written by non-programmers.

Trifacta designed a “readable” DSL for data wrangling but goes one step further: Its users “live in visualizations, not code.” Its elegant visual interface is designed to accomplish most data-wrangling tasks, but it also lets users access and modify accompanying scripts written in Trifacta’s DSL (power users can also use regular expressions).

These ideas go beyond data wrangling. Combining DSLs with visual interfaces can open up other aspects of data analysis to non-programmers.

Intelligence and Automation

If you’re dealing with thousands of data sources, then you’ll need tools that can automate routine steps. Tamr’s next-generation extract, transform, load (ETL) platform uses machine learning in a variety of ways, including schema consolidation and expert (crowd) sourcing.

Many data analysis tasks involve a handful of data sources that require painstaking data wrangling along the way. Scripts to automate data preparation are needed for replication and maintenance. Trifacta looks at user behavior and context to produce “utterances” of its DSL, which users can then edit or modify.

Don’t Forget About Replication

If you believe the adage that data wrangling consumes a lot of time and resources, then it goes without saying that tools like Tamr and Trifacta should produce reusable scripts and track lineage. Other aspects of data science—for example, model building, deployment, and maintenance—need tools with similar capabilities.

Why the Data Center Needs an Operating System

Developers today are building a new class of applications. These applications no longer fit on a single server, but instead run across a fleet of servers in a data center. Examples include analytics frameworks like Apache Hadoop and Apache Spark, message brokers like Apache Kafka, key/value stores like Apache Cassandra, as well as customer-facing applications such as those run by Twitter and Netflix.

These new applications are more than applications—they are distributed systems. Just as it became commonplace for developers to build multithreaded applications for single machines, it’s now becoming commonplace for developers to build distributed systems for data centers.

But it’s difficult for developers to build distributed systems, and it’s difficult for operators to run distributed systems. Why? Because we expose the wrong level of abstraction to both developers and operators: machines.

Machines Are the Wrong Abstraction

Machines are the wrong level of abstraction for building and running distributed applications. Exposing machines as the abstraction to developers unnecessarily complicates the engineering, causing developers to build software constrained by machine-specific characteristics, like IP addresses and local storage. This makes moving and resizing applications difficult if not impossible, forcing maintenance in data centers to be a highly involved and painful procedure.

With machines as the abstraction, operators deploy applications in anticipation of machine loss, usually by taking the easiest and most conservative approach of deploying one application per machine. This almost always means machines go underutilized, as we rarely buy our machines (virtual or physical) to exactly fit our applications, or size our applications to exactly fit our machines.

By running only one application per machine, we end up dividing our data center into highly static, highly inflexible partitions of machines, one for each distributed application. We end up with a partition that runs analytics, another that runs the databases, another that runs the web servers, another that runs the message queues, and so on. And the number of partitions is only bound to increase as companies replace monolithic architectures with service-oriented architectures and build more software based on microservices.

What happens when a machine dies in one of these static partitions? Let’s hope we over-provisioned sufficiently (wasting money), or can re-provision another machine quickly (wasting effort). What about when the web traffic dips to its daily low? With static partitions, we allocate for peak capacity, which means when traffic is at its lowest, all of that excess capacity is wasted. This is why a typical data center runs at only 8%–15% efficiency. And don’t be fooled just because you’re running in the cloud: You’re still being charged for the resources your application is not using on each virtual machine (someone is benefiting—it’s just your cloud provider, not you).

And finally, with machines as the abstraction, organizations must employ armies of people to manually configure and maintain each individual application on each individual machine. People become the bottleneck for trying to run new applications, even when there are ample resources already provisioned that are not being utilized.

If My Laptop Were a Data Center

Imagine if we ran applications on our laptops the same way we run applications in our data centers. Each time we launched a web browser or text editor, we’d have to specify which CPU to use, which memory modules are addressable, which caches are available, and so on. Thankfully, our laptops have an operating system that abstracts us away from the complexities of manual resource management.

In fact, we have operating systems for our workstations, servers, mainframes, supercomputers, and mobile devices, each optimized for their unique capabilities and form factors.

We’ve already started treating the data center itself as one massive warehouse-scale computer. Yet, we still don’t have an operating system that abstracts and manages the hardware resources in the data center just like an operating system does on our laptops.

It’s Time for the Data Center OS

What would an operating system for the data center look like?

From an operator’s perspective, it would span all of the machines in a data center (or cloud) and aggregate them into one giant pool of resources on which applications would be run. You would no longer configure specific machines for specific applications; all applications would be capable of running on any available resources from any machine, even if there are other applications already running on those machines.

From a developer’s perspective, the data center operating system would act as an intermediary between applications and machines, providing common primitives to facilitate and simplify building distributed applications.

The data center operating system would not need to replace Linux or any other host operating systems we use in our data centers today. The data center operating system would provide a software stack on top of the host operating system. Continuing to use the host operating system to provide standard execution environments is critical to immediately supporting existing applications.

The data center operating system would provide functionality for the data center that is analogous to what a host operating system provides on a single machine today: namely, resource management and process isolation. Just like with a host operating system, a data center operating system would enable multiple users to execute multiple applications (made up of multiple processes) concurrently, across a shared collection of resources, with explicit isolation between those applications.

An API for the Data Center

Perhaps the defining characteristic of a data center operating system is that it provides a software interface for building distributed applications. Analogous to the system call interface for a host operating system, the data center operating system API would enable distributed applications to allocate and deallocate resources, launch, monitor, and destroy processes, and more. The API would provide primitives that implement common functionality that all distributed systems need. Thus, developers would no longer need to independently re-implement fundamental distributed systems primitives (and inevitably, independently suffer from the same bugs and performance issues).

Centralizing common functionality within the API primitives would enable developers to build new distributed applications more easily, more safely, and more quickly. This is reminiscent of when virtual memory was added to host operating systems. In fact, one of the virtual memory pioneers wrote that “it was pretty obvious to the designers of operating systems in the early 1960s that automatic storage allocation could significantly simplify programming.”

Example Primitives

Two primitives specific to a data center operating system that would immediately simplify building distributed applications are service discovery and coordination. Unlike on a single host where very few applications need to discover other applications running on the same host, discovery is the norm for distributed applications. Likewise, most distributed applications achieve high availability and fault tolerance through some means of coordination and/or consensus, which is notoriously hard to implement correctly and efficiently.

Developers today are forced to choose between existing tools for service discovery and coordination, such as Apache ZooKeeper and CoreOS’ etcd. This forces organizations to deploy multiple tools for different applications, significantly increasing operational complexity and maintainability.

Having the data center operating system provide primitives for discovery and coordination not only simplifies development, it also enables application portability. Organizations can change the underlying implementations without rewriting the applications, much like you can choose between different filesystem implementations on a host operating system today.

A New Way to Deploy Applications

With a data center operating system, a software interface replaces the human interface that developers typically interact with when trying to deploy their applications today; rather than a developer asking a person to provision and configure machines to run their applications, developers launch their applications using the data center operating system (e.g., via a CLI or GUI), and the application executes using the data center operating system’s API.

This supports a clean separation of concerns between operators and users: Operators specify the amount of resources allocatable to each user, and users launch whatever applications they want, using whatever resources are available to them. Because an operator now specifies how much of any type of resource is available, but not which specific resource, a data center operating system, and the distributed applications running on top, can be more intelligent about which resources to use in order to execute more efficiently and better handle failures. Because most distributed applications have complex scheduling requirements (think Apache Hadoop) and specific needs for failure recovery (think of a database), empowering software to make decisions instead of humans is critical for operating efficiently at data-center scale.

The “Cloud” Is Not an Operating System

Why do we need a new operating system? Didn’t Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) already solve these problems?

IaaS doesn’t solve our problems because it’s still focused on machines. It isn’t designed with a software interface intended for applications to use in order to execute. IaaS is designed for humans to consume, in order to provision virtual machines that other humans can use to deploy applications; IaaS turns machines into more (virtual) machines, but does not provide any primitives that make it easier for a developer to build distributed applications on top of those machines.

PaaS, on the other hand, abstracts away the machines, but is still designed first and foremost to be consumed by a human. Many PaaS solutions do include numerous tangential services and integrations that make building a distributed application easier, but not in a way that’s portable across other PaaS solutions.

Apache Mesos: The Distributed Systems Kernel

Distributed computing is now the norm, not the exception, and we need a data center operating system that delivers a layer of abstraction and a portable API for distributed applications. Not having one is hindering our industry. Developers should be able to build distributed applications without having to reimplement common functionality. Distributed applications built in one organization should be capable of being run in another organization easily.

Existing cloud computing solutions and APIs are not sufficient. Moreover, the data center operating system API must be built, like Linux, in an open and collaborative manner. Proprietary APIs force lock-in, deterring a healthy and innovative ecosystem from growing. It’s time we created the POSIX for distributed computing: a portable API for distributed systems running in a data center or on a cloud.

The open source Apache Mesos project, of which I am one of the co-creators and the project chair, is a step in that direction. Apache Mesos aims to be a distributed systems kernel that provides a portable API upon which distributed applications can be built and run.

Many popular distributed systems have already been built directly on top of Mesos, including Apache SparkApache AuroraAirbnb’s Chronos, and Mesosphere’s Marathon. Other popular distributed systems have been ported to run on top of Mesos, including Apache HadoopApache Storm, and Google’s Kubernetes, to list a few.

Chronos is a compelling example of the value of building on top of Mesos. Chronos, a distributed system that provides highly available and fault-tolerant cron, was built on top of Mesos in only a few thousand lines of code and without having to do any explicit socket programming for network communication.

Companies like Twitter and Airbnb are already using Mesos to help run their data centers, while companies like Google have been using in-house solutions they built almost a decade ago. In fact, just like Google’s MapReduce spurred an industry around Apache Hadoop, Google’s in-house data center solutions have had close ties with the evolution of Mesos.

While not a complete data center operating system, Mesos, along with some of the distributed applications running on top, provide some of the essential building blocks from which a full data center operating system can be built: the kernel (Mesos), a distributed init.d (Marathon/Aurora), cron (Chronos), and more.

A Tale of Two Clusters: Mesos and YARN

This is a tale of two siloed clusters. The first cluster is an Apache Hadoop cluster. This is an island whose resources are completely isolated to Hadoop and its processes. The second cluster is the description I give to all resources that are not a part of the Hadoop cluster. I break them up this way because Hadoop manages its own resources with Apache YARN (Yet Another Resource Negotiator). Although this is nice for Hadoop, all too often those resources are underutilized when there are no big data workloads in the queue. And then when a big data job comes in, those resources are stretched to the limit, and they are likely in need of more resources. That can be tough when you are on an island.

Hadoop was meant to tear down walls—albeit, data silo walls—but walls, nonetheless. What has happened is that while tearing some walls down, other types of walls have gone up in their place.

Another technology, Apache Mesos, is also meant to tear down walls—but Mesos has often been positioned to manage the “second cluster,” which are all of those other, non-Hadoop workloads.

This is where the story really starts, with these two silos of Mesos and YARN (see Figure 4-1). They are often pitted against each other, as if they were incompatible. It turns out they work together, and therein lies my tale.

Figure 4-1. Isolated clusters (image courtesy of Mesosphere and MapR, used with permission)

Brief Explanation of Mesos and YARN

The primary difference between Mesos and YARN is around their design priorities and how they approach scheduling work. Mesos was built to be a scalable global resource manager for the entire data center. It was designed at UC Berkeley in 2007 and hardened in production at companies like Twitter and Airbnb. YARN was created out of the necessity to scale Hadoop. Prior to YARN, resource management was embedded in Hadoop MapReduce V1, and it had to be removed in order to help MapReduce scale. The MapReduce 1 JobTracker wouldn’t practically scale beyond a couple thousand machines. The creation of YARN was essential to the next iteration of Hadoop’s lifecycle, primarily around scaling.

Mesos Scheduling

Mesos determines which resources are available, and it makes offers back to an application scheduler (the application scheduler and its executor is called a “framework”). Those offers can be accepted or rejected by the framework. This model is considered a non-monolithic model because it is a “two-level” scheduler, where scheduling algorithms are pluggable. Mesos allows an infinite number of schedule algorithms to be developed, each with its own strategy for which offers to accept or decline, and can accommodate thousands of these schedulers running multitenant on the same cluster.

The two-level scheduling model of Mesos allows each framework to decide which algorithms it wants to use for scheduling the jobs that it needs to run. Mesos plays the arbiter, allocating resources across multiple schedulers, resolving conflicts, and making sure resources are fairly distributed based on business strategy. Offers come in, and the framework can then execute a task that consumes those offered resources. Or the framework has the option to decline the offer and wait for another offer to come in. This model is very similar to how multiple apps all run simultaneously on a laptop or smartphone, in that they spawn new threads or request more memory as they need it, and the operating system arbitrates among all of the requests. One of the nice things about this model is that it is based on years of operating system and distributed systems research and is very scalable. This is a model that Google and Twitter have proven at scale.

YARN Scheduling

Now, let’s look at what happens over on the YARN side. When a job request comes into the YARN resource manager, YARN evaluates all the resources available, and it places the job. It’s the one making the decision where jobs should go; thus, it is modeled in a monolithic way. It is important to reiterate that YARN was created as a necessity for the evolutionary step of the MapReduce framework. YARN took the resource-management model out of the MapReduce 1 JobTracker, generalized it, and moved it into its own separate ResourceManager component, largely motivated by the need to scale Hadoop jobs.

YARN is optimized for scheduling Hadoop jobs, which are historically (and still typically) batch jobs with long run times. This means that YARN was not designed for long-running services, nor for short-lived interactive queries (like small and fast Spark jobs), and while it’s possible to have it schedule other kinds of workloads, this is not an ideal model. The resource demands, execution model, and architectural demands of MapReduce are very different from those of long-running services, such as web servers or SOA applications, or real-time workloads like those of Spark or Storm. Also, YARN was designed for stateless batch jobs that can be restarted easily if they fail. It does not handle running stateful services like distributed filesystems or databases. While YARN’s monolithic scheduler could theoretically evolve to handle different types of workloads (by merging new algorithms upstream into the scheduling code), this is not a lightweight model to support a growing number of current and future scheduling algorithms.

Is It YARN Versus Mesos?

When comparing YARN and Mesos, it is important to understand the general scaling capabilities and why someone might choose one technology over the other. While some might argue that YARN and Mesos are competing for the same space, they really are not. The people who put these models in place had different intentions from the start, and that’s OK. There is nothing explicitly wrong with either model, but each approach will yield different long-term results. I believe this is the key between when to use one, the other, or both. Mesos was built at the same time as Google’s Omega. Ben Hindman and the Berkeley AMPlab team worked closely with the team at Google designing Omega so that they both could learn from the lessons of Google’s Borg and build a better nonmonolithic scheduler.

When you evaluate how to manage your data center as a whole, on one side you’ve got Mesos (which can manage all the resources in your data center), and on the other, you have YARN (which can safely manage Hadoop jobs, but is not capable of managing your entire data center). Data center operators tend to solve for these two use cases by partitioning their clusters into Hadoop and non-Hadoop worlds.

Using Mesos and YARN in the same data center, to benefit from both resource managers, currently requires that you create two static partitions. Using both would mean that certain resources would be dedicated to Hadoop for YARN to manage and Mesos would get the rest. It might be oversimplifying it, but that is effectively what we are talking about here. Fundamentally, this is the issue we want to avoid.

Introducing Project Myriad

This leads us to the question: Can we make YARN and Mesos work together? Can we make them work harmoniously for the benefit of the enterprise and the data center? The answer is “yes.” A few well-known companies—eBay, MapR, and Mesosphere—collaborated on a project called Myriad.

This open source software project is both a Mesos framework and a YARN scheduler that enables Mesos to manage YARN resource requests. When a job comes into YARN, it will schedule it via the Myriad Scheduler, which will match the request to incoming Mesos resource offers. Mesos, in turn, will pass it on to the Mesos worker nodes. The Mesos nodes will then communicate the request to a Myriad executor which is running the YARN node manager. Myriad launches YARN node managers on Mesos resources, which then communicate to the YARN resource manager what resources are available to them. YARN can then consume the resources as it sees fit. As shown in Figure 4-2, Myriad provides a seamless bridge from the pool of resources available in Mesos to the YARN tasks that want those resources.

Figure 4-2. How Myriad works (image courtesy of Mesosphere and MapR, used with permission)

The beauty of this approach is that not only does it allow you to elastically run YARN workloads on a shared cluster, but it actually makes YARN more dynamic and elastic than it was originally designed to be. This approach also makes it easy for a data center operations team to expand resources given to YARN (or, take them away, as the case might be) without ever having to reconfigure the YARN cluster. It becomes very easy to dynamically control your entire data center. This model also provides an easy way to run and manage multiple YARN implementations, even different versions of YARN on the same cluster.

Myriad blends the best of both the YARN and Mesos worlds. By utilizing Myriad, Mesos and YARN can collaborate, and you can achieve an as-it-happens business. Data analytics can be performed in-place on the same hardware that runs your production services. No longer will you face the resource constraints (and low utilization) caused by static partitions. Resources can be elastically reconfigured to meet the demands of the business as it happens (see Figure 4-3).

Figure 4-3. Resource sharing (image courtesy of Mesosphere and MapR, used with permission)

Final Thoughts

To make sure people understand where I am coming from here, I feel that both Mesos and YARN are very good at what they were built to achieve, yet both have room for improvement. Both resource managers can improve in the area of security; security support is paramount to enterprise adoption.

Mesos needs an end-to-end security architecture, and I personally would not draw the line at Kerberos for security support, as my personal experience with it is not what I would call “fun.” The other area for improvement in Mesos—which can be extremely complicated to get right—is what I will characterize as resource revocation and preemption. Imagine the use case where all resources in a business are allocated and then the need arises to have the single most important “thing” that your business depends on run—even if this task only requires minutes of time to complete, you are out of luck if the resources are not available. Resource preemption and/or revocation could solve that problem. There are currently ways around this in Mesos today, but I look forward to the work the Mesos committers are doing to solve this problem with Dynamic Reservations and Optimistic (Revocable) Resources Offers.

Myriad is an enabling technology that can be used to take advantage of leveraging all of the resources in a data center or cloud as a single pool of resources. Myriad enables businesses to tear down the walls between isolated clusters, just as Hadoop enabled businesses to tear down the walls between data silos. With Myriad, developers will be able to focus on the data and applications on which the business depends, while operations will be able to manage compute resources for maximum agility. This opens the door to being able to focus on data instead of constantly worrying about infrastructure. With Myriad, the constraints on the storage network and coordination between compute and data access are the last-mile concern to achieve full flexibility, agility, and scale.

Project Myriad is hosted on GitHub and is available for download. There’s documentation there that provides more in-depth explanations of how it works. You’ll even see some nice diagrams. Go out, explore, and give it a try.

The Truth About MapReduce Performance on SSDs

It is a well-known fact that solid-state drives (SSDs) are fast and expensive. But exactly how much faster—and more expensive—are they than the hard disk drives (HDDs) they’re supposed to replace? And does anything change for big data?

I work on the performance engineering team at Cloudera, a data management vendor. It is my job to understand performance implications across customers and across evolving technology trends. The convergence of SSDs and big data does have the potential to broadly impact future data center architectures. When one of our hardware partners loaned us a number of SSDs with the mandate to “find something interesting,” we jumped on the opportunity. This post shares our findings.

As a starting point, we decided to focus on MapReduce. We chose MapReduce because it enjoys wide deployment across many industry verticals—even as other big data frameworks such as SQL-on-Hadoop, free text search, machine learning, and NoSQL gain prominence.

We considered two scenarios: first, when setting up a new cluster, we explored whether SSDs or HDDs, of equal aggregate bandwidth, are superior; second, we explored how cluster operators should configure SSDs, when upgrading an HDDs-only cluster.

SSDs Versus HDDs of Equal Aggregate Bandwidth

For our measurements, we used the storage configuration in the following table (the machines were Intel Xeon 2-socket, 8-core, 16-thread systems, with 10 GBps Ethernet and 48 GB RA):

Setup Storage Capacity Sequential R/W bandwidth Price
HDD 11 HDDs 22 TB 1300 MBps $4,400
SSD 1 SSD 1.3 TB 1300 MBps $14,000

By their physical design, SSDs avoid the large seek overhead of small, random I/O for HDDs. SSDs do perform much better for shuffle-heavy MapReduce jobs. In the graph shown in Figure 4-4, “terasort” is a common benchmark with 1:1:1 ratio between input:shuffle:output sizes; “shuffle” is a shuffle-only job that we wrote in-house to purposefully stress only the shuffle part of MapReduce. SSDs offer as much as 40% lower job duration, which translates to 70% higher performance. A common but incomplete mental model assumes that MapReduce contains only large, sequential read and writes. MapReduce does exhibit large, sequential I/O when reading input from and writing output to HDFS. The intermediate shuffle stage, in contrast, involves smaller read and writes. The output of each map task is partitioned across many reducers in the job, and each reduce task fetches only its own data. In our customer workloads, this led to each reduce task accessing as little as a few MBs from each map task.

Figure 4-4. Terasort and shuffle job durations, using HDDs and SSDs (graph courtesy of Yanpei Chen)

To our initial surprise, we learned that SSDs also benefit MapReduce jobs that involve only HDFS reads and writes, despite HDDs having the same aggregate sequential bandwidth according to hardware specs. In the graph shown in Figure 4-5, “teragen” writes data to HDFS with three-fold replication, “teravalidate” reads the output of terasort and checks if they are in sorted order, and “hdfs data write” is a job we wrote in-house and writes data to HDFS with single-fold replication. SSDs again offer up to 40% lower job duration, equating to 70% higher performance.

Figure 4-5. Teragen, teravalidate, and HDFS data write job durations using HDDs and SSDs (graph courtesy of Yanpei Chen)

It turns out that our SSDs have an advantage for sequential workloads because they deliver higher sequential I/O size—2x larger than the HDDs in our test setup. To write the same amount of data, SSDs incur half the number of I/Os. This difference may be a vendor-specific characteristic, as other SSDs or HDDs likely offer different default configurations for sequential I/O sizes.

There is another kind of MapReduce job—one that is dominated by compute rather than I/O. When the resource bottleneck is not the I/O subsystem, the choice of storage media makes no difference. In the graph shown in Figure 4-6, “wordcount” is a job that involves high CPU load parsing text and counting word frequencies; “shuffle compressed” is the shuffle-only job from earlier, except with MapReduce shuffle compression enabled. Enabling this configuration shifts load from I/O to CPU. The advantage from SSDs decreases considerably compared with the uncompressed “shuffle” from earlier.

Figure 4-6. Wordcount and shuffle compressed job durations using HDDs and SSDs (graph courtesy of Yanpei Chen)

Ultimately, we learned that SSDs offer considerable performance benefit for some workloads, and at worst do no harm. The decision on whether to use SSDs would then depend on any premium cost to obtain higher performance. We’ll return to that discussion later.

Configuring a Hybrid HDD-SSD Cluster

Almost all existing MapReduce clusters use HDDs. There are two ways to introduce SSDs: (1) buy a new SSD-only cluster, or (2) add SSDs to existing HDD-only machines (some customers may prefer the latter option for cost and logistical reasons). Therefore, we found it meaningful to figure out a good way to configure a hybrid HDD-SSD cluster.

We set up clusters with the following storage configurations:

Setup Storage Capacity Sequential R/W bandwidth Price
HDD-baseline 6 HDDs 12 TB 720 MBps $2,400
HDD-11 11 HDDs 22 TB 1300 MBps $4,400
Hybrid 6 HDDs + 1 SSD 13.3 TB 2020 MBps $16,400

We started with a low-I/O-bandwidth cluster of six HDDs. With default configurations, adding a single SSD leads to higher performance, about the same improvement we get by adding five HDDs. This is an undesirable result, because the single additional SSD has double the bandwidth than the additional five HDDs (see Figure 4-7).

Figure 4-7. Job durations across six HDD clusters (graph courtesy of Yanpei Chen)

A closer look at HDFS and MapReduce implementations reveals a critical insight: Both the HDFS DataNode and the MapReduce NodeManager write to local directories in a round-robin fashion. A typical setup would mount each piece of storage hardware as a separate directory—for example, /mnt/disk-1, /mnt/disk-2, /mnt/ssd-1. With each of these directories mounted as an HDFS and MapReduce local directory, they each receive the same amount of data. Faster progress on the SSD does not accelerate slower progress on the HDDs.

So, to fully utilize the SSD, we need to split the SSD into multiple directories to maintain equal bandwidth per local directory. In our case, SSDs should be split into 10 directories. The SSDs would then receive 10x the data directed at each HDD, written at 10x the speed, and complete in the same amount of time. When the SSD capacity accommodates the 10x data size written, performance is much better than the default setup (see Figure 4-8).

Figure 4-8. Job durations across six hybrid clusters (graph courtesy of Yanpei Chen)

Price Per Performance Versus Price Per Capacity

We found that for our tests and hardware, SSDs delivered up to 70% higher performance, for 2.5x higher $-per-performance (average performance divided by cost). Each customer can decide whether the higher performance is worth the premium cost. This decision employs the $-per-performance metric, which differs from the $-per-capacity metric that storage vendors more frequently track. The SSDs we used hold a 50x premium for $-per-capacity—a gap far larger than the 2.5x premium for $-per-performance.

The primary benefit of SSD is high performance, rather than high capacity. Storage vendors and customers should also consider $-per-performance, and develop architectures to work around capacity constraints.

The following table compares the $-per-performance and $-per-capacity between HDDs and SSDs (we also include some updated data we received from different hardware partners earlier this year; the $-per-performance gap is approaching parity even as the $-per-capacity gap remains wide).

Setup Unit cost Capacity Unit BW US$ per TB US$ per MBps
HDD circa 2013 $400 2 TB 120 MBps 200 (2013 baseline) 3.3 (2013 baseline)
SSD circa 2013 $14,000 1.3 TB 1300 MBps 10,769 (54x 2013 baseline) 10.8 (2.5x 2013 baseline)
HDD circa 2015 $250 4 TB 120 MBps 62.5 (2015 baseline) 2.1 (2015 baseline)
SSD circa 2015 $6,400 2 TB 2000 MBps 3,200 (51x 2015 baseline) 3.2 (1.5x 2015 baseline)

SSD Economics—Exploring the Trade-Offs

Overall, SSD economics involve the interplay between ever-improving software and hardware as well as ever-evolving customer workloads. The precise trade-off between SSDs, HDDs, and memory deserves regular reexamination over time.

We encourage members of the community to extend our work and explore how SSDs benefit SQL-on-Hadoop, free text search, machine learning, NoSQL, and other big data frameworks.

More extended versions of this work appeared on the Cloudera Engineering Blog and at the Large Installation System Administration Conference (LISA) 2014.

Accelerating Big Data Analytics Workloads with Tachyon

As an early adopter of Tachyon, I can testify that it lives up to its description as “a memory-centric distributed storage system, enabling reliable data sharing at memory-speed, across cluster frameworks.” Besides being reliable and having memory-speed, Tachyon also provides a means to expand beyond memory to provide enough storage capacity.

As a senior architect at Baidu USA, I’ve spent the past nine months incorporating Tachyon into Baidu’s big data infrastructure. Since then, we’ve seen a 30-fold increase in speed in our big data analytics workloads. In this post, I’ll share our experiences and the lessons we’ve learned during our journey of adopting and scaling Tachyon.

Creating an Ad-Hoc Query Engine

Baidu is the biggest search engine in China, and the second biggest search engine in the world. Put simply, we have a lot of data. How to manage the scale of this data, and quickly extract meaningful information, has always been a challenge.

To give you an example, product managers at Baidu need to track top queries that are submitted to Baidu daily. They take the top 10 queries, and drill down to find out which province of China contributes the most information to the top queries. Product managers then analyze the resulting data to extract meaningful business intelligence.

Due to the sheer volume of data, however, each query would take tens of minutes, to hours, just to finish—leaving product managers waiting hours before they could enter the next query. Even more frustrating was that modifying a query would require running the whole process all over again. About a year ago, we realized the need for an ad-hoc query engine. To get started, we came up with a high level of specification: The query engine would need to manage petabytes of data and finish 95% of queries within 30 seconds.

From Hive, to Spark SQL, to Tachyon

With this specification in mind, we took a close look at our original query system, which ran on Hive (a query engine on top of Hadoop). Hive was able to handle a large amount of data and provided very high throughput. The problem was that Hive is a batch system and is not suitable for interactive queries.

So, why not just change the engine?

We switched to Spark SQL as our query engine (many use cases have demonstrated its superiority over Hadoop Map Reduce in terms of latency). We were excited and expected Spark SQL to drop the average query time to within a few minutes. Still, it did not quite get us all the way. While Spark SQL did help us achieve a 4-fold increase in the speed of our average query, each query still took around 10 minutes to complete.

So, we took a second look and dug into more details. It turned out that the issue was not CPU—rather, the queries were stressing the network. Since the data was distributed over multiple data centers, there was a high probability that a query would hit a remote data center in order to pull data over to the compute center—this is what caused the biggest delay when a user ran a query. With different hardware specifications for the storage nodes and the compute nodes, the answer was not as simple as moving the compute nodes to the data center. We decided we could solve the problem with a cache layer that buffered the frequently used data, so that most of the queries would hit the cache layer without leaving the data center.

A High-Performance, Reliable Cache Layer

We needed a cache layer that could provide high performance and reliability, and manage a petabyte-scale of data. We developed a query system that used Spark SQL as its compute engine, and Tachyon as a cache layer, and we stress tested for a month. For our test, we used a standard query within Baidu, which pulled 6 TB of data from a remote data center, and then we ran additional analysis on top of the data.

The performance was amazing. With Spark SQL alone, it took 100–150 seconds to finish a query; using Tachyon, where data may hit local or remote Tachyon nodes, it took 10–15 seconds. And if all of the data was stored in Tachyon local nodes, it took about 5 seconds, flat—a 30-fold increase in speed. Based on these results, and the system’s reliability, we built a full system around Tachyon and Spark SQL.

Tachyon and Spark SQL

The anatomy of the system, as shown in Figure 4-9:

Operation manager
A persistent Spark application that wraps Spark SQL. It accepts queries from query UI, and performs query parsing and optimization.
View manager
Manages cache metadata and handles query requests from the operation manager.
Serves as a cache layer in the system and buffers the frequently used data.
Data warehouse
The remote data center that stores the data in HDFS-based systems.
Figure 4-9. Overview of the Tachyon and Spark SQL system (courtesy of Shaoshan Liu)

Now, let’s discuss the physiology of the system:

  1. A query gets submitted. The operation manager analyzes the query and asks the view manager if the data is already in Tachyon.
  2. If the data is already in Tachyon, the operation manager grabs the data from Tachyon and performs the analysis on it.
  3. If data is not in Tachyon, then it is a cache miss, and the operation manager requests data directly from the data warehouse. Meanwhile, the view manager initiates another job to request the same data from the data warehouse and stores the data in Tachyon. This way, the next time the same query gets submitted, the data is already in Tachyon.

How to Get Data from Tachyon

After the Spark SQL query analyzer (Catalyst) performs analysis on the query, it sends a physical plan, which contains HiveTableScan statements (see Figure 4-10). These statements specify the address of the requested data in the data warehouse. The HiveTableScan statements identify the table name, attribute names, and partition keys. Note that the cache metadata is a key/value store, the <table name, partition keys, attribute names> tuple is the key to the cache metadata, and the value is the name of the Tachyon file. So, if the requested <table name, partition keys, attribute names> combination is already in cache, we can simply replace these values in the HiveTableScan with a Tachyon filename, and then the query statement knows to pull data from Tachyon instead of the data warehouse.

Figure 4-10. Physical plan showing HiveTableScan statements (courtesy of Shaoshan Liu)

Performance and Deployment

With the system deployed, we also measured its performance using a typical Baidu query. Using the original Hive system, it took more than 1,000 seconds to finish a typical query (see Figure 4-11). With the Spark SQL-only system, it took 150, and using our new Tachyon + Spark SQL system, it took about 20 seconds. We achieved a 50-fold increase in speed and met the interactive query requirements we set out for the project.

Figure 4-11. Query performance times using Hive, Spark SQL, and Tachyon + Spark SQL systems (image courtesy of Shaoshan Liu)

Today, the system is deployed in a cluster with more than 100 nodes, providing more than two petabytes of cache space, using an advanced feature (tiered storage) in Tachyon. This feature allows us to use memory as the top-level cache, SSD as the second-level cache, and HDD as the last-level cache; with all of these storage mediums combined, we are able to provide two petabytes of storage space.

Problems Encountered in Practice


The first time we used Tachyon, we were shocked—it was not able to cache anything! What we discovered was that Tachyon would only cache a block if the whole block was read into Tachyon. For example, if the block size was 512 MB, and you read 511 MB, the block would not cache. Once we understood the mechanism, we developed a workaround. The Tachyon community is also developing a page-based solution so that the cache granularity is 4 KB instead of the block size.


The second problem we encountered was when we launched a Spark job, the Spark UI told us that the data was node-local, meaning that we should not have to pull data from remote Tachyon nodes. However, when we ran the query, it fetched a lot of data from remote nodes. We expected the local cache hit rate to be 100%, but when the actual hit rate was about 33%, we were puzzled.

Digging into the raw data, we found it was because we used an outdated HDFS InputFormat, meaning that if we requested block 2 for the computation, it would pull a line from block 1 and a line from block 3, even though you didn’t need any data from block 1 or 3. So, if block 2 was in the local Tachyon node, then blocks 1 and 3 may be in remote Tachyon nodes—leading to a local cache hit rate of 33% instead of 100%. Once we updated our InputFormat, this problem was resolved.


Sometimes we would get a SIGBUS error, and the Tachyon process would crash. Not only Tachyon, but Spark had the same problem, too. The Spark community actually has a workaround for this problem that uses fewer memory-mapped files, but that was not the real solution. The root cause of the problem was that we were using Java 6 with the CompressedOOP feature, which compressed 64-bit pointers to 32-bit pointers, to reduce memory usage. However, there was a bug in Java 6 that allowed the compressed pointer to point to an invalid location, leading to a SIGBUS error. We solved this problem by either not using CompressedOOP in Java 6, or simply updating to Java 7.

Time-to-Live Feature and What’s Next for Tachyon

In our next stage of development, we plan to expand Tachyon’s functionalities and performance. For instance, we recently developed a Time-to-Live (TTL) feature in Tachyon. This feature helps us reduce Tachyon cache space usage automatically.

For the cache layer, we only want to keep data for two weeks (since people rarely query data beyond this point). With the TTL feature, Tachyon is able to keep track of the data and delete it once it expires—leaving enough space for fresh data. Also, there are efforts to use Tachyon as a parameter server for deep learning (Adatao is leading the effort in this direction). As memory becomes cheaper, I expect to see the universal usage of Tachyon in the near future.

Get Big Data Now: 2015 Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.