Chapter 3. Top 10 List

Given a set of (key-as-string, value-as-integer) pairs, say we want to create a top N (where N > 0) list. Top N is a design pattern (recall from Chapter 1 that a design pattern is a language-independent reusable solution to a common problem that enables us to produce reusable code). For example, if key-as-string is a URL and value-as-integer is the number of times that URL is visited, then you might ask: what are the top 10 URLs for last week? This kind of question is common for these types of key-value pairs. Finding a top 10 list is categorized as a filtering pattern (i.e., you filter out data and find the top 10 list). For details on the Top N design pattern, refer to the book MapReduce Design Patterns by Donald Miner and Adam Shook[18].

This chapter provides five complete MapReduce solutions for the Top N design pattern and its associated implementations with Apache Hadoop (using classic MapReduce’s map() and reduce() functions) and Apache Spark (using resilient distributed data sets):

  • Top 10 solution in MapReduce/Hadoop. We assume that all input keys are unique. That is, for a given input set {(K, V)}, all Ks are unique.

  • Top 10 solution in Spark. We assume that all input keys are unique. That is, for a given input set {(K, V)}, all Ks are unique. For this solution, we do not use Spark’s sorting functions, such as top() or takeOrdered().

  • Top 10 solution in Spark. We assume that all input keys are not unique. That is, for a given input set {(K

Get Data Algorithms now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.