Chapter 2. Getting Started

Let’s install Hadoop and Hive on our personal workstation. This is a convenient way to learn and experiment with Hadoop. Then we’ll discuss how to configure Hive for use on Hadoop clusters.

If you already use Amazon Web Services, the fastest path to setting up Hive for learning is to run a Hive-configured job flow on Amazon Elastic MapReduce (EMR). We discuss this option in Chapter 21.

If you have access to a Hadoop cluster with Hive already installed, we encourage you to skim the first part of this chapter and pick up again at What Is Inside Hive?.

Installing a Preconfigured Virtual Machine

There are several ways you can install Hadoop and Hive. An easy way to install a complete Hadoop system, including Hive, is to download a preconfigured virtual machine (VM) that runs in VMWare[11] or VirtualBox[12]. For VMWare, either VMWare Player for Windows and Linux (free) or VMWare Fusion for Mac OS X (inexpensive) can be used. VirtualBox is free for all these platforms, and also Solaris.

The virtual machines use Linux as the operating system, which is currently the only recommended operating system for running Hadoop in production.[13]


Using a virtual machine is currently the only way to run Hadoop on Windows systems, even when Cygwin or similar Unix-like software is installed.

Most of the preconfigured virtual machines (VMs) available are only designed for VMWare, but if you prefer VirtualBox you may find instructions on the Web that explain how to import a particular ...

Get Programming Hive now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.