A single Map task may output many key-value pairs with the same key causing Hadoop to shuffle (move) all those values over the network to the Reduce tasks, incurring a significant overhead. For example, in the previous WordCount MapReduce program, when a Mapper encounters multiple occurrences of the same word in a single Map task, the
map function would output many
<word,1> intermediate key-value pairs to be transmitted over the network. However, we can optimize this scenario if we can sum all the instances of
<word,1> pairs to a single
<word, count> pair before sending the data across the network to the Reducers.
To optimize such scenarios, Hadoop supports a special function called combiner ...