Chapter 6. Diagnosing and tuning performance problems


In this chapter
  • Measuring and visualizing MapReduce execution times
  • Optimizing the shuffle and sort phases
  • Improving performance with user space MapReduce best practices


Imagine you wrote a new piece of MapReduce code and you’re executing it on your shiny new cluster. You’re surprised to learn that despite having a good-size cluster, your job is running significantly longer than you expected. You’ve obviously hit a performance issue with your job, but how do you figure out where the problem lies?

One of Hadoop’s selling points when it comes to performance is that it scales horizontally. This means that adding nodes tends to yield a linear increase in throughput, and often in job execution ...

Get Hadoop in Practice now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.