Chapter 4. Reductions in Spark
This chapter focuses on reduction transformations on RDDs in Spark. In particular, we’ll work with RDDs of (key, value) pairs, which are a common data abstraction required for many operations in Spark. Some initial ETL operations may be required to get your data into a (key, value) form, but with pair RDDs you may perform any desired aggregation over a set of values.
Spark supports several powerful reduction transformations and actions. The most important reduction transformations are:
-
reduceByKey() -
combineByKey() -
groupByKey() -
aggregateByKey()
All of the *ByKey() transformations accept a source
RDD[(K, V)] and create a target
RDD[(K, C)] (for some transformations,
such as reduceByKey(), V and C
are the same). The function of these transformations
is to reduce all the values of a given key
(for all unique keys), by finding, for example:
-
The average of all values
-
The sum and count of all values
-
The mode and median of all values
-
The standard deviation of all values
Reduction Transformation Selection
As with mapper transformations, it’s important to select the right tool for the job. For some reduction operations (such
as finding the median), the reducer needs access to all the values at
the same time. For others, such as finding the sum or count
of all values, it doesn’t. If you want to find
the median of values per key, then groupByKey() will be a good choice, but this transformation does not do well if a key has lots of values ...