Extracting features from the MovieLens dataset

We will use the ALS algorithm to get numerical features for users and items (movies) in this case before we can use the clustering algorithm on the data:

  1. First we load the data u.data into a DataFrame:
      val ratings = spark.sparkContext       .textFile(DATA_PATH + "/u.data")       .map(_.split("\t"))       .map(lineSplit => Rating(lineSplit(0).toInt,         lineSplit(1).toInt,  lineSplit(2).toFloat,         lineSplit(3).toLong))       .toDF()
  1. Then we split it into a 80:20 ratio to get the training and test data:
      val Array(training, test) =          ratings.randomSplit(Array(0.8, 0.2))
  1. We instantiate the ALS class, set the maximum iterations at 5, and the regularization parameter at 0.01:
      val als = new ALS()  .setMaxIter(5)  ...

Get Machine Learning with Spark - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.