Skip to Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Chapter 2. Secondary Sort: A Detailed Example

The MapReduce framework sorts input to reducers by key, but values of reducers are arbitrarily ordered. This means that if all mappers generate the following key-value pairs for key = K:

  • (K, V1), (K, V2), ..., (K, Vn)

then all these values {V1, V2, ..., Vn} will be processed by a single reducer (for key = K), but there will be no order (ascending or descending) between instances of Vi. As you learned in Chapter 1, Secondary Sort is a design pattern we can use to apply an order (such as “ascending sort” or “descending sort”) to the values. How do we accomplish this? Say we want to apply some order to the reducer values:

  • S1 ≤ S2 ≤ ... ≤ Sn

or:

  • S1 ≥ S2 ≥ ... ≥ Sn

where Si{V1, V2, ..., Vn} for i = {1, 2, ..., n}. Note that each Vi might be a simple data type, such as String or Integer, or a tuple (more than a single value—that is, a composite object).

There are two ways to sort reducer values:

Solution #1

Buffer reducer values in memory, then sort. If the number of reducer values is small enough to fit in memory (per reducer), then this solution will work. But if the number of reducer values is high, then they might not fit in memory (not a preferable solution). Implementation of this solution is simple; it is presented in Chapter 1 and will not be discussed in this chapter.

Solution #2

Use the Secondary Sort design pattern of the MapReduce framework, and reducer values will arrive sorted (i.e., there’s no need to sort values ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Graph Algorithms

Graph Algorithms

Mark Needham, Amy E. Hodler
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content