Hadoop in the cloud

Making a case for Big-Data-as-a-Service.

By Thomas Phelan and Joel Baxter
March 17, 2016
Sun and clouds. Sun and clouds. (source: Teodoro S Gruhl on PublicDomainPictures)

Hadoop and other big data application frameworks (such as Spark) were designed and built around the assumption that distributed parallel processing and the minimization of storage and network latencies are key to maximizing data query performance over large data sets.

This assumption places constraints on the architecture and deployment for big data infrastructure. For example, since its inception, Hadoop has mandated that the co-location of storage and compute is essential for good performance.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

To the cloud?

These requirements would seem to indicate that cloud-based infrastructure is ill-suited for running big data workloads. Indeed, the majority of early Hadoop implementations were run on bare-metal servers with directly attached storage, in an on-premises environment. This is the traditional deployment model and remains the “conventional wisdom” for many in the Hadoop industry.

Cloud computing implies shared infrastructure resources, and cloud-based infrastructure provides value through resource abstraction. The physical distance between compute and storage resources in a cloud environment may be large and/or unknown. It is possible to force co-location and make a cloud-based deployment look more like a physical one, but this would come at additional cost and with a loss of flexibility.

So, is it possible to run a high-performance big data framework (like Hadoop or Spark) in a cloud-based environment? The short answer is yes.

The case for Big-Data-as-a-Service

Cloud-based options for deploying Hadoop have been available for several years. Multiple providers have introduced Hadoop-as-a-Service, Spark-as-a-Service, and other similar offerings. A new category of Big-Data-as-a-Service (BDaaS) solutions has emerged. Most of these solutions today are being delivered in a public cloud service off-premises. And now, there are also BDaaS solutions available for on-premises deployment.

Over the past few years, we’ve seen an increase in organizations that are developing, testing, and deploying big data applications in cloud environments. The value proposition includes the traditional cloud benefits of cost reduction (especially upfront CapEx), elasticity, flexibility, and agility. In addition, big data applications and tools change on a seemingly daily basis; a cloud-based deployment can help reduce the challenge and complexity of keeping skills up-to-date with these technologies.

However, these advantages are offset by the often-cited cloud challenges of:

  • System availability
  • Performance
  • Data security

Our upcoming session at Strata + Hadoop World in San Jose will focus on the deployment of big data applications (including Hadoop and Spark, as well as other big data frameworks and tools) in cloud-based BDaaS environments—whether in a public cloud or on-premises, leveraging virtualization and container technology. In that context, let’s examine the three cloud-related challenges identified above.

Evaluating Big-Data-as-a-Service

The potential challenges around system availability for big data applications in a cloud-based deployment are the least concerning. Availability was once a major concern for cloud environments, but it is no longer as significant an issue. For a given cost, cloud-based deployments are at least as reliable as on-site, bare-metal deployments.

The challenges of performance for big data in a cloud-based deployment can be broken down into two factors: compute performance and storage access performance.

Historically, virtualization extracted a tax in the form of CPU performance for its benefits of flexibility and scalability. But with the advent of new virtualization techniques, such as containers, this compute tax has been largely eliminated. CPU performance need not be a barrier to the adoption of BDaaS—whether in a public cloud or on-premises.

As for storage access, the introduction of fast networks and new storage technologies (like SSDs), and techniques (like object stores) requires re-evaluation of the old Hadoop mandate for co-location of compute and storage. Numerous studies (such as Microsoft’s research into Flat Datacenter Storage or the AMPLab analysis of logs from Facebook’s data center) now challenge this conventional wisdom, and show that such co-location is often not necessary for high performance with distributed applications.

Challenges in data security with BDaaS are also being addressed. The levels of data security offered in public cloud environments vary by service provider, but in general, they continue to improve. However, some organizations have regulatory, privacy, or operational requirements (i.e. the complexity and cost of moving petabytes of data) that prevent them from deploying their big data applications in a public cloud service. For these organizations, the emerging option to implement BDaaS in an on-premises environment provides additional control and security, while minimizing data movement.

The determination of which big data deployment model is best suited for a given use case or organization is less a matter of technology than it is about trade-offs (e.g. security, cost, performance). Whether the deployment is on bare-metal or using containers, in a public cloud or on-premises, there are multiple requirements and multiple options to consider. We’ll explore these requirements, options, and trade-offs in our session at Strata + Hadoop World in San Jose.

Post topics: Data science