Implementing k-means using H2O over Spark

In this recipe, we'll look at how to run a k-means clustering algorithm on a dataset of figures concerning prostate cancer. Please download the dataset from https://github.com/ChitturiPadma/datasets/blob/master/prostate.csv. This is prostate cancer data that came from a study that examined the correlation between the level of prostate-specific antigen and a number of other clinical measures in men.

Getting ready

To step through this recipe, you will need a running Spark Cluster in any one of the following modes: Local, standalone, YARN, Mesos. Include the Spark MLlib package in the build.sbt file so that it downloads the related libraries and the API can be used. Install Hadoop (optionally), Scala, and Java. ...

Get Apache Spark for Data Science Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.