Chapter 6. Diagnosing and tuning performance problems
- Measuring and visualizing MapReduce execution times
- Optimizing the shuffle and sort phases
- Improving performance with user space MapReduce best practices
Imagine you wrote a new piece of MapReduce code and you’re executing it on your shiny new cluster. You’re surprised to learn that despite having a good-size cluster, your job is running significantly longer than you expected. You’ve obviously hit a performance issue with your job, but how do you figure out where the problem lies?
One of Hadoop’s selling points when it comes to performance is that it scales horizontally. This means that adding nodes tends to yield a linear increase in throughput, and often in job execution ...