Running k-means clustering with Mahout

Now that we have a sequence file of vectors suitable for consumption by Mahout, it's time to actually run k-means clustering on the whole dataset. Unlike our local Incanter version, Mahout won't have any trouble dealing with the full Reuters corpus.

As with the SequenceFilesFromDirectory class, we've created a wrapper around another of Mahout's command-line programs, KMeansDriver. The Clojure variable names make it easier to see what each command-line argument is for.

(defn run-kmeans [in-path clusters-path out-path k] (let [distance-measure "org.apache.mahout.common.distance.CosineDistanceMeasure" max-iterations 100 convergence-delta 0.001] (KMeansDriver/main (->> (vector "-i" in-path "-c" clusters-path "-o" ...

Get Clojure for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.