Time for action – handling dirty data by using skip mode

Let's see skip mode in action by writing a MapReduce job that receives the data that causes it to fail:

  1. Save the following Ruby script as gendata.rb:
    File.open("skipdata.txt", "w") do |file|
      3.times do
        500000.times{file.write("A valid record\n")}
        5.times{file.write("skiptext\n")}
      end
      500000.times{file.write("A valid record\n")}
    End
  2. Run the script:
    $ ruby gendata.rb 
    
  3. Check the size of the generated file and its number of lines:
    $ ls -lh skipdata.txt
    -rw-rw-r-- 1 hadoop hadoop 29M 2011-12-17 01:53 skipdata.txt
    ~$ cat skipdata.txt | wc -l
    2000015
    
  4. Copy the file onto HDFS:
    $ hadoop fs -put skipdata.txt skipdata.txt
    
  5. Add the following property definition to mapred-site.xml:
    <property> <name>mapred.skip.map.max.skip.records</name> ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.