Chapter 6. Big Data in the Cloud

By Edd Dumbill

Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide. So goes the marketing message, but what does that look like in reality? Both “cloud” and “big data” have broad definitions, obscured by considerable hype. This article breaks down the landscape as simply as possible, highlighting what’s practical, and what’s to come.

IaaS and Private Clouds

What is often called “cloud” amounts to virtualized servers: computing resource that presents itself as a regular server, rentable per consumption. This is generally called infrastructure as a service (IaaS), and is offered by platforms such as Rackspace Cloud or Amazon EC2. You buy time on these services, and install and configure your own software, such as a Hadoop cluster or NoSQL database. Most of the solutions I described in my Big Data Market Survey can be deployed on IaaS services.

Using IaaS clouds doesn’t mean you must handle all deployment manually: good news for the clusters of machines big data requires. You can use orchestration frameworks, which handle the management of resources, and automated infrastructure tools, which handle server installation and configuration. RightScale offers a commercial multi-cloud management platform that mitigates some of the problems of managing servers in the cloud.

Frameworks such as OpenStack and Eucalyptus aim to present a uniform interface to both private data centers and ...

Get Planning for Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.