Tuning shuffle, merge, and sort parameters

In a MapReduce job, map task outputs are aggregated into JVM buffers. The size of the in-memory buffer determines how large the data can be merged and sorted at once. Too small a buffer size can cause a large number of swap operations, incurring big overhead. In this section, we will show best practices for configuring the shuffle, merge, and sort parameters.

Getting ready

We assume that the Hadoop cluster has been properly configured and all the daemons are running without any issues.

Log in from the Hadoop cluster administrator machine to the cluster master node using the following command:

ssh hduser@master

Note

In this recipe, we assume all the configurations are making changes to the $HADOOP_HOME/conf/mapred-site.xml ...

Get Hadoop Operations and Cluster Management Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.