O'Reilly logo

Apache Mahout Clustering Designs by Ashish Gupta

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Performance tuning for the job

Closely investigating the Mahout job shows that Mahout jobs can create CPU and network bottlenecks. The distance computation and vectorization process is a CPU bound activity, while transmitting centroids to the reducer is a network bound activity. By closely investigating the parameters of the job's CPU, network, disk, and so on, the pitfalls can be avoided.

We can create a different type of vector representation of data in Mahout, such as dense vector, sparse vector, and so on As per the definition of the dense vector, it saves the zero for non-existing elements. So, if the data is very sparse, the dense vector will unnecessarily serialize the data and slow down the performance. So, in this case, it is better to ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required