Chapter 3. Introduction to Big Data and Data Science

The popular use of big data can be traced to a single research paper published in 2004: “MapReduce: Simplified Data Processing on Large Clusters”, by Jeffrey Dean and Sanjay Ghemawat. In this 13-page paper (including source code), two engineers at Google explained how the company had found a way to bring its gigantic indexing needs down to reasonable processing requirements through a radically new type of algorithm running on massively parallel clusters. The basic idea of MapReduce is to break work into mappers that can run in parallel and reducers that take the output of mappers and process it. The first operation is called “mapping” because it takes each element of input data and “maps” a function onto it, leaving the output for the reducer to handle.

For example, to count words in all the documents on all the nodes in a cluster, assuming each document is stored on a single node, we can have thousands of mappers, running in parallel, produce a list of documents and the word count of each, and send that list to the reducer. The reducer will then create a master list of all documents with their word counts and calculate the total word count by adding all the counts for all the documents together (Figure 3-1). Assuming that the disk is much slower than the network and that the mapper reading documents is much slower than sending a total to the reducer, this program would scale very nicely to a large cluster without any perceptible ...

Get The Enterprise Big Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.