Plugging into a MapReduce framework

For background on the Apache Hadoop server, see https://hadoop.apache.org. Here's the summary:

The Apache Hadoop software library is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

One part of the Hadoop distributed processing is the MapReduce module. This module allows us to decompose analysis of data into two complementary operations: map and reduce. These operations are distributed around the Hadoop cluster to be run concurrently. A map operation processes all of the rows of datasets that are scattered around ...

Get Python Essentials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.