Chapter 10. Content-Based Recommendation: Movies

Have you ever wondered how Netflix creates movie recommendations for its users? Or how Amazon creates book recommendations for its users? There must be some kind of magic algorithm to generate this kind of recommendation, right? Netflix even offered a $1 million prize for finding the optimal solution for movie recommendations[20]. Content-based recommendation systems, such as those used by Netflix and Amazon, examine properties of items (such as movies) in order to make recommendations to users. For example, if a user has watched a lot of action movies, then the recommendation system will suggest movies in that category.

This chapter presents a basic MapReduce content-based recommendation solution, based on Edwin Chen’s blog[6]. Suppose you run an online movie business, and you want to generate movie recommendations. You have a rating system (people can rate movies from 1 to 5 stars), and we’ll assume for simplicity’s sake that all of the ratings are stored in a TSV (tab-separated value) files in the HDFS. After presenting a generic MapReduce solution, I’ll provide a concrete Spark implementation for movie recommendations.

Note that in content-based recommendation systems, the more information (such as domain knowledge and metadata) we have about the content, the more complex the algorithms become (as more variables are involved), but the recommendations become more accurate and reasonable. For example, for movie recommendations ...

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.