Time for action – summarizing the UFO data
Now we have the data, let's get an initial summarization of its size and how many records may be incomplete:
- With the UFO tab-separated value (TSV) file on HDFS saved as
ufo.tsv
, save the following file tosummarymapper.rb
:#!/usr/bin/env ruby while line = gets puts "total\t1" parts = line.split("\t") puts "badline\t1" if parts.size != 6 puts "sighted\t1" if !parts[0].empty? puts "recorded\t1" if !parts[1].empty? puts "location\t1" if !parts[2].empty? puts "shape\t1" if !parts[3].empty? puts "duration\t1" if !parts[4].empty? puts "description\t1" if !parts[5].empty? end
- Make the file executable by executing the following command:
$ chmod +x summarymapper.rb
- Execute the job as follows by using Streaming: ...
Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.