Chapter 8. MapReduce Types and Formats
MapReduce has a simple model of data processing: inputs and outputs for the map and reduce functions are key-value pairs. This chapter looks at the MapReduce model in detail, and in particular at how data in various formats, from simple text to structured binary objects, can be used with this model.
MapReduce Types
The map and reduce functions in Hadoop MapReduce have the following general form:
map: (K1, V1) → list(K2, V2) reduce: (K2, list(V2)) → list(K3, V3)
In general, the map input key and value types (K1
and V1
)
are different from the map output types (K2
and V2
).
However, the reduce input must have the same types as the map output,
although the reduce output types may be different again (K3
and V3
).
The Java API mirrors this general form:
public
class
Mapper
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
public
class
Context
extends
MapContext
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
// ...
}
protected
void
map
(
KEYIN
key
,
VALUEIN
value
,
Context
context
)
throws
IOException
,
InterruptedException
{
// ...
}
}
public
class
Reducer
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
public
class
Context
extends
ReducerContext
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
// ...
}
protected
void
reduce
(
KEYIN
key
,
Iterable
<
VALUEIN
>
values
,
Context
context
)
throws
IOException
,
InterruptedException
{
// ...
}
}
The context objects are used for emitting key-value pairs, and they
are parameterized by the output types so that the signature of the
write()
method is:
public
void
write
(
KEYOUT
key ...
Get Hadoop: The Definitive Guide, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.