Time for action – summarizing the UFO data

Now we have the data, let's get an initial summarization of its size and how many records may be incomplete:

  1. With the UFO tab-separated value (TSV) file on HDFS saved as ufo.tsv, save the following file to summarymapper.rb:
    #!/usr/bin/env ruby
    
    while line = gets
        puts "total\t1"
        parts = line.split("\t")
        puts "badline\t1" if parts.size != 6
        puts "sighted\t1" if !parts[0].empty?
        puts "recorded\t1" if !parts[1].empty?
        puts "location\t1" if !parts[2].empty?
        puts "shape\t1" if !parts[3].empty?
        puts "duration\t1" if !parts[4].empty?
        puts "description\t1" if !parts[5].empty?
    end
  2. Make the file executable by executing the following command:
    $ chmod +x summarymapper.rb
    
  3. Execute the job as follows by using Streaming: ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.