Chapter 9. Standing Up a Cluster

Now that you have instances up and running in the cloud provider of your choice, they can be set up to run a Hadoop cluster. If you don’t have instances at the ready and want to follow along, then go back to Chapter 6 for AWS, Chapter 7 for Google Cloud Platform, or Chapter 8 for Azure first, and then return here.

The JDK

Hadoop requires a Java runtime to work, and so Java must be installed on each of your new instances. A good strategy is to use the operating system package management capability already on the instances, e.g., yum on Red Hat Linux, apt on Ubuntu. Cloud providers ensure that these capabilities work within their infrastructures, sometimes even providing local mirrors or gateways to help.

Table 9-1 suggests packages to install for some operating systems. As new versions of Java are released, the package names will change.

Table 9-1. Suggested Java packages
OS Package names

Debian or Ubuntu

openjdk-8-jdk or openjdk-7-jdk

Red Hat or CentOS

java-1.8.0-openjdk or java-1.7.0-openjdk

Instead of using a package available natively for your operating system, you can install an Oracle JDK by downloading an installation package directly from Oracle. Since you have root access to your instances, you are free to use whatever means you prefer to install Java.

After you have installed Java, make note of where the Java home directory is (i.e., what the JAVA_HOME environment variable should be set to). You will need to know this ...

Get Moving Hadoop to the Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.