We will use the ALS algorithm to get numerical features for users and items (movies) in this case before we can use the clustering algorithm on the data:
- First we load the data u.data into a DataFrame:
val ratings = spark.sparkContext .textFile(DATA_PATH + "/u.data") .map(_.split("\t")) .map(lineSplit => Rating(lineSplit(0).toInt, lineSplit(1).toInt, lineSplit(2).toFloat, lineSplit(3).toLong)) .toDF()
- Then we split it into a 80:20 ratio to get the training and test data:
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
- We instantiate the ALS class, set the maximum iterations at 5, and the regularization parameter at 0.01:
val als = new ALS() .setMaxIter(5) ...