Chapter 21. Hive and Amazon Web Services (AWS)

Mark Grover

One of the services that Amazon provides as a part of Amazon Web Services (AWS) is Elastic MapReduce (EMR). With EMR comes the ability to spin up a cluster of nodes on demand. These clusters come with Hadoop and Hive installed and configured. (You can also configure the clusters with Pig and other tools.) You can then run your Hive queries and terminate the cluster when you are done, only paying for the time you used the cluster. This section describes how to use Elastic MapReduce, some best practices, and wraps up with pros and cons of using EMR versus other options.

You may wish to refer to the online AWS documentation available at http://aws.amazon.com/elasticmapreduce/ while reading this chapter. This chapter won’t cover all the details of using Amazon EMR with Hive. It is designed to provide an overview and discuss some practical details.

Why Elastic MapReduce?

Small teams and start-ups often don’t have the resources to set up their own cluster. An in-house cluster is a fixed cost of initial investment. It requires effort to set up and servers and switches as well as maintaining a Hadoop and Hive installation.

On the other hand, Elastic MapReduce comes with a variable cost, plus the installation and maintenance is Amazon’s responsibility. This is a huge benefit for teams that can’t or don’t want to invest in their own clusters, and even for larger teams that need a test bed to try out new tools and ideas without affecting ...

Get Programming Hive now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.