Often, the datasets contain duplicate items that need to be eliminated to ensure the accuracy of the results. In this recipe, we use Hadoop to remove the duplicate mail records in the 20news dataset. These duplicate records are due to the users cross-posting the same message to multiple newsboards.
The following steps show how to remove duplicate mails due to cross-posting across the lists, from the 20news dataset:
$ wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz $ tar –xzf 20news-19997.tar.gz