
44 Large Scale and Big Data
Hadoop* is an open-source Java library [131] that supports data-intensive distrib-
uted applications by realizing the implementation of the MapReduce framework.
†
It has been widely used by a large number of business companies for production
purposes.
‡
On the implementation level, the map invocations of a MapReduce job are
distributed across multiple machines by automatically partitioning the input data into
a set of M splits. The input splits can be processed in parallel by different machines.
Reduce invocations are distributed by partitioning the intermediate key space into
R pieces using a partitioning function ( ...