O'Reilly logo

Hadoop MapReduce v2 Cookbook - Second Edition by Thilina Gunarathne

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

De-duplicating data using Hadoop streaming

Often, the datasets contain duplicate items that need to be eliminated to ensure the accuracy of the results. In this recipe, we use Hadoop to remove the duplicate mail records in the 20news dataset. These duplicate records are due to the users cross-posting the same message to multiple newsboards.

Getting ready

  • Make sure Python is installed on your Hadoop compute nodes.

How to do it...

The following steps show how to remove duplicate mails due to cross-posting across the lists, from the 20news dataset:

  1. Download and extract the 20news dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz:
    $ wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz
    $ tar –xzf 20news-19997.tar.gz
    
  2. Upload the ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required