O'Reilly logo

Clojure for Data Science by Henry Garner

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Running k-means clustering with Mahout

Now that we have a sequence file of vectors suitable for consumption by Mahout, it's time to actually run k-means clustering on the whole dataset. Unlike our local Incanter version, Mahout won't have any trouble dealing with the full Reuters corpus.

As with the SequenceFilesFromDirectory class, we've created a wrapper around another of Mahout's command-line programs, KMeansDriver. The Clojure variable names make it easier to see what each command-line argument is for.

(defn run-kmeans [in-path clusters-path out-path k] (let [distance-measure "org.apache.mahout.common.distance.CosineDistanceMeasure" max-iterations 100 convergence-delta 0.001] (KMeansDriver/main (->> (vector "-i" in-path "-c" clusters-path "-o" ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required