Running k-means clustering with Mahout

Now that we have a sequence file of vectors suitable for consumption by Mahout, it's time to actually run k-means clustering on the whole dataset. Unlike our local Incanter version, Mahout won't have any trouble dealing with the full Reuters corpus.

As with the SequenceFilesFromDirectory class, we've created a wrapper around another of Mahout's command-line programs, KMeansDriver. The Clojure variable names make it easier to see what each command-line argument is for.

(defn run-kmeans [in-path clusters-path out-path k] (let [distance-measure "org.apache.mahout.common.distance.CosineDistanceMeasure" max-iterations 100 convergence-delta 0.001] (KMeansDriver/main (->> (vector "-i" in-path "-c" clusters-path "-o" ...

Get Clojure for Data Science now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.