Chapter 3. Top 10 List
Given a set of (key-as-string, value-as-integer)
pairs, say we want to create a top N (where N > 0) list. Top N is a design pattern (recall from Chapter 1 that a design pattern is a language-independent reusable solution to a common problem that enables us to produce reusable code). For example, if key-as-string
is a URL and value-as-integer
is the number of times that URL is visited, then you might ask: what are the top 10 URLs for last week? This kind of question is common for these types of key-value pairs. Finding a top 10 list is categorized as a filtering pattern (i.e., you filter out data and find the top 10 list). For details on the Top N design pattern, refer to the book MapReduce Design Patterns by Donald Miner and Adam Shook[18].
This chapter provides five complete MapReduce solutions for the Top N design pattern and its associated implementations with Apache Hadoop (using classic MapReduce’s map()
and reduce()
functions) and Apache Spark (using resilient distributed data sets):
-
Top 10 solution in MapReduce/Hadoop. We assume that all input keys are unique. That is, for a given input set {(K, V)}, all Ks are unique.
-
Top 10 solution in Spark. We assume that all input keys are unique. That is, for a given input set {(K, V)}, all Ks are unique. For this solution, we do not use Spark’s sorting functions, such as
top()
ortakeOrdered()
. -
Top 10 solution in Spark. We assume that all input keys are not unique. That is, for a given input set {(K
Get Data Algorithms now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.