O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

The MovieLens 100k dataset

The MovieLens 100k dataset is a set of 100,000 data points related to ratings given by a set of users to a set of movies. It also contains movie metadata and user profiles. While it is a small dataset, you can quickly download it and run Spark code on it. This makes it ideal for illustrative purposes.

You can download the dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip.

Once you have downloaded the data, unzip it using your terminal:

>unzip ml-100k.zipinflating: ml-100k/allbut.pl       inflating: ml-100k/mku.sh          inflating: ml-100k/README  ...inflating: ml-100k/ub.base         inflating: ml-100k/ub.test

This will create a directory called ml-100k. Change into this directory and examine the contents. The ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required