Map Reduce program to find distinct values

In this recipe, we are going to learn how to write a map reduce program to find distinct values from a given set of data.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as an eclipse that is similar to an IDE.

How to do it

Sometimes, there may be a chance that the data you have contains some duplicate values. In SQL, we have something called a distinct function, which helps us get distinct values. In this recipe, we are going to take a look at how we can get distinct values using map reduce programs.

Let's consider a use case where we have some user data with us, which contains two columns: userId and username. Let's assume that the data we have contains duplicate records, ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.