Processing Data with Map Reduce

Hadoop Map Reduce is a system for parallel processing of very large data sets using distributed fault-tolerant storage over very large clusters. The input data set is broken down into pieces, which are the inputs to the Map functions. The Map functions then filter and sort these data chunks (whose size is configurable) on the Hadoop cluster data nodes. The output of the Map processes is delivered to the Reduce processes, which shuffle and summarize the data to produce the resulting output.

This chapter explores Map Reduce programming through multiple implementations of a simple, but flexible word-count ...

Get Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.