O'Reilly logo

Data Algorithms by Mahmoud Parsian

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 3. Top 10 List

Given a set of (key-as-string, value-as-integer) pairs, say we want to create a top N (where N > 0) list. Top N is a design pattern (recall from Chapter 1 that a design pattern is a language-independent reusable solution to a common problem that enables us to produce reusable code). For example, if key-as-string is a URL and value-as-integer is the number of times that URL is visited, then you might ask: what are the top 10 URLs for last week? This kind of question is common for these types of key-value pairs. Finding a top 10 list is categorized as a filtering pattern (i.e., you filter out data and find the top 10 list). For details on the Top N design pattern, refer to the book MapReduce Design Patterns by Donald Miner and Adam Shook[18].

This chapter provides five complete MapReduce solutions for the Top N design pattern and its associated implementations with Apache Hadoop (using classic MapReduce’s map() and reduce() functions) and Apache Spark (using resilient distributed data sets):

  • Top 10 solution in MapReduce/Hadoop. We assume that all input keys are unique. That is, for a given input set {(K, V)}, all Ks are unique.

  • Top 10 solution in Spark. We assume that all input keys are unique. That is, for a given input set {(K, V)}, all Ks are unique. For this solution, we do not use Spark’s sorting functions, such as top() or takeOrdered().

  • Top 10 solution in Spark. We assume that all input keys are not unique. That is, for a given input set {(K

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required