Apache Hadoop as software is a simple framework that allows for distributed processing of data across many machines. As a technology, Hadoop and the surrounding ecosystem have changed the way we think about data processing at scale. No longer does our data need to fit in the memory of a single machine, nor are we limited by the I/O of a single machine’s disks. These are powerful tenets.
So too has cloud computing changed our way of thinking. While the notion of colocating machines in a faraway data center isn’t new, allowing users to provision machines on-demand is, and it’s changed everything. No longer are developers or architects limited by the processing power installed in on-premise data centers, nor do we need to host small web farms under our desks or in that old storage closet. The pay-as-you-go model has been a boon for ad hoc testing and proof-of-concept efforts, eliminating time spent in purchasing, installation, and setup.
Both Hadoop and cloud computing represent major paradigm shifts, not just in enterprise computing, but affecting many other industries. Much has been written about how these technologies have been used to make advances in retail, public sector, manufacturing, energy, and healthcare, just to name a few. Entire businesses have sprung up as a result, dedicated to the care, feeding, integration, and optimization of these new systems.
It was inevitable that Hadoop workloads would be run on cloud computing providers’ infrastructure. The cloud offers incredible flexibility to users, often complementing on-premise solutions, enabling them to use Hadoop in ways simply not possible previously.
Ever the conscientious software engineer, author Bill Havanki has a strong penchant for documenting. He’s able to break down complex concepts and explain them in simple terms, without making you feel foolish. Bill writes the kind of documentation that you actually enjoy, the kind you find yourself reading long after you’ve discovered the solution to your original problem.
Hadoop and cloud computing are powerful and valuable tools, but aren’t simple technologies by any means. This stuff is hard. Both have a multitude of configuration options and it’s very easy to become overwhelmed. All major cloud providers offer similar services like virtual machines, network attached storage, relational databases, and object storage—all of which can be utilized by Hadoop—but each provider uses different naming conventions and has different capabilities and limitations. For example, some providers require that resource provisioning occurs in a specific order. Some providers create isolated virtual networks for your machines automatically while others require manual creation and assignment. It can be confusing. Whether you’re working with Hadoop for the first time or a veteran installing on a cloud provider you’ve never used before, knowing about the specifics of each environment will save you a lot of time and pain.
Cloud computing appeals to a dizzying array of users running a wide variety of workloads. Most cloud providers’ official documentation isn’t specific to any particular application (such as Hadoop). Using Hadoop on cloud infrastructure introduces additional architectural issues that need to be considered and addressed. It helps to have a guide to demystify the options specific to Hadoop deployments and to ease you through the setup process on a variety of cloud providers, step by step, providing tips and best practices along the way. This book does precisely that, in a way that I wish had been available when I started working in the cloud computing world.
Whether code or expository prose, Bill’s creations are approachable, sensible, and easy to consume. With this book and its author, you’re in capable hands for your first foray into moving Hadoop to the Cloud.