Chapter 6. Hadoop MapReduce
You write MapReduce jobs in Java when you need low-level control and want to optimize or streamline your big data pipeline. Using MapReduce is not required, but it is rewarding, because it is a beautifully designed system and API. Learning the basics can get you very far, very quickly, but before you embark on writing a customized MapReduce job, don’t overlook the fact that tools such as Apache Drill enable you to write standard SQL queries on Hadoop.
This chapter assumes you have a running Hadoop Distributed File System (HDFS) on your local machine or have access to a Hadoop cluster. To simulate how a real MapReduce job would run, we can run Hadoop in pseudodistributed mode on one node, either your localhost or a remote machine. Considering how much CPU, RAM, and storage resources we can fit on one box (laptop) these days, you can, in essence, create a mini supercomputer capable of running fairly massive distributed jobs. You can get pretty far on your localhost (on a subset of data) and then scale up to a full cluster when your application is ready.
If the Hadoop client is properly installed, you can get a complete listing of all available Hadoop operations by simply typing the following:
bash$ hadoopHadoop Distributed File System
Apache Hadoop comes with a command-line tool useful for accessing the Hadoop
filesystem and launching MapReduce jobs. The filesystem access command fs is invoked as follows:
bash$ hadoop fs <command> <args>
The command is any number ...