Time for action – handling dirty data by using skip mode
Let's see skip mode in action by writing a MapReduce job that receives the data that causes it to fail:
- Save the following Ruby script as
gendata.rb
:File.open("skipdata.txt", "w") do |file| 3.times do 500000.times{file.write("A valid record\n")} 5.times{file.write("skiptext\n")} end 500000.times{file.write("A valid record\n")} End
- Run the script:
$ ruby gendata.rb
- Check the size of the generated file and its number of lines:
$ ls -lh skipdata.txt -rw-rw-r-- 1 hadoop hadoop 29M 2011-12-17 01:53 skipdata.txt ~$ cat skipdata.txt | wc -l 2000015
- Copy the file onto HDFS:
$ hadoop fs -put skipdata.txt skipdata.txt
- Add the following property definition to
mapred-site.xml
:<property> <name>mapred.skip.map.max.skip.records</name> ...
Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.