62 | Big Data Simplied
3.7 HADOOP IN THE CLOUD
Literally, there are a number of different ways to implement a Hadoop cluster in the cloud. This
can be a very worthwhile approach if you have an immediate need to build out a big Hadoop
cluster on a temporary basis, so that provisioning the Hadoop infrastructure in the cloud and
then turning it off when it is no longer needed can be done pretty fast when compared to setting
up a cluster on premise.
Amazon has an offering specifically for this purpose. It is called Elastic MapReduce (EMR),
where the word ‘elastic’ suggesting that larger or smaller clusters can be built on a needed basis.
Amazon’s EMR service provides access not only to Amazon’s own distribution of Hadoop, but also
to the distributions available for MapR. Amazon’s EMR is probably the most common approach
to implementing Hadoop in the cloud. The browser-based interface provides capabilities that
enable you to stay away from the command line, but a command line interface is available as well.
Interesting Facts
There are numerous challenges in maintaining a Hadoop cluster which includes cluster monitor-
ing, management of resources, efficient mechanisms of bring up or down a Hadoop cluster with
all its services, efficient ways of tuning the services, and monitoring of jobs. These processes can
be automated with a third party vendor tool on top of the core Apache Hadoop distribution.
This is called Hadoop Distributions Package. Also, by using a Hadoop Distribution Package, you
can easily implement security with kerberos authentication at files and schema levels.
Recently, two major players of Hadoop Distributions, Cloudera and Hortonworks are merged
up. All commercial Hadoop distribution packages have their own certification programs in the
Hadoop Development and as well as Hadoop Administration areas.
Summary
• Configure different types of Hadoop clus-
ters like Sandbox, Development, UAT, Prod
as per business needs and the volume of
data. In an enterprise cluster, the Hadoop
cluster is mainly designed with some main
key points, and they are namely volume
and characteristic of the data, number of
developers, different Hadoop tool’s require-
ment, needed throughput of the output,
i.e., speed of the data, etc. These different
parameters are very useful to determine
the best possible hardware requirements in
a Hadoop cluster, such as Hard drive size
per machine, memory (RAM) and number
of core per machine,etc.
• In a Hadoop cluster, NameNode and
Resource Manager (RM) processes has HA
(High Availability) and they are always avail-
able. In this way, NameNode and Resource
Manager (RM) are removed the tag line of
Hadoop previous version (Hadoop 0.x and
1.x), i.e., ‘single point of failure’. So now the
availability of the main Hadoop resources
are magnificently increased.
M03 Big Data Simplified XXXX 01.indd 62 5/10/2019 9:57:33 AM

Get Big Data Simplified now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.