Time for action – handling dirty data by using skip mode

Let's see skip mode in action by writing a MapReduce job that receives the data that causes it to fail:

Save the following Ruby script as gendata.rb:

File.open("skipdata.txt", "w") do |file|
  3.times do
    500000.times{file.write("A valid record\n")}
    5.times{file.write("skiptext\n")}
  end
  500000.times{file.write("A valid record\n")}
End

Run the script:
```
$ ruby gendata.rb 
```

Check the size of the generated file and its number of lines:

$ ls -lh skipdata.txt
-rw-rw-r-- 1 hadoop hadoop 29M 2011-12-17 01:53 skipdata.txt
~$ cat skipdata.txt | wc -l
2000015

Copy the file onto HDFS:

$ hadoop fs -put skipdata.txt skipdata.txt

Add the following property definition to mapred-site.xml:

<property> <name>mapred.skip.map.max.skip.records</name> ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hadoop: Data Processing and Modelling by Garry Turkington, Tanmay Deshpande, Sandeep Karanth

Time for action – handling dirty data by using skip mode

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly