Chapter 3. Top 10 List

Given a set of (key-as-string, value-as-integer) pairs, say we want to create a top N (where N > 0) list. Top N is a design pattern (recall from Chapter 1 that a design pattern is a language-independent reusable solution to a common problem that enables us to produce reusable code). For example, if key-as-string is a URL and value-as-integer is the number of times that URL is visited, then you might ask: what are the top 10 URLs for last week? This kind of question is common for these types of key-value pairs. Finding a top 10 list is categorized as a filtering pattern (i.e., you filter out data and find the top 10 list). For details on the Top N design pattern, refer to the book MapReduce Design Patterns by Donald Miner and Adam Shook[18].

This chapter provides five complete MapReduce solutions for the Top N design pattern and its associated implementations with Apache Hadoop (using classic MapReduce’s map() and reduce() functions) and Apache Spark (using resilient distributed data sets):

Top 10 solution in MapReduce/Hadoop. We assume that all input keys are unique. That is, for a given input set {(K, V)}, all Ks are unique.
Top 10 solution in Spark. We assume that all input keys are unique. That is, for a given input set {(K, V)}, all Ks are unique. For this solution, we do not use Spark’s sorting functions, such as top() or takeOrdered().
Top 10 solution in Spark. We assume that all input keys are not unique. That is, for a given input set {(K

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Algorithms by Mahmoud Parsian

Chapter 3. Top 10 List

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly