Python MapReduce

Hadoop can be downloaded and installed from https://hadoop.apache.org/. We'll be using the Hadoop streaming API to execute our Python MapReduce program in Hadoop. The Hadoop Streaming API helps in using any program that has a standard input and output as a MapReduce program.

We'll be writing three MapReduce programs using Python, they are as follows:

  • A basic word count
  • Getting the sentiment Score of each review
  • Getting the overall sentiment score from all the reviews

The basic word count

We'll start with the word count MapReduce. Save the following code in a word_mapper.py file:

import sys
for l in sys.stdin:
    # Trailing and Leading white space is removed
    l = l.strip()

    # words in the line is split
    word_tokens = l.split()

 # Key Value ...

Get Mastering Python for Data Science now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.