Enhancing reduce tasks
Reduce task processing consists of a sequence of three phases. Only the execution of the user-defined reduce function is custom, and its duration depends on the amount of data flowing through each phase and the performance of the underlying Hadoop cluster. Profiling each of these phases will help you to identify potential bottlenecks and low speeds of data processing. The following figure shows the three major phases of Reduce tasks:
Let's see each phase in some detail:
- Profiling the Shuffle phase implies that you need to measure the time taken to transfer the intermediate data from map tasks to the reduce tasks and then merge ...