November 2017
Intermediate to advanced
670 pages
17h 35m
English
Suppose we have an Apache web server access log files with entries that look like this one:
198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET /example.com/music/js/main.js HTTP/1.1" 200 614 "http://www.example.com/music/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"
What if we are interested in knowing the top 5 most accessed JSON files?
We could perform a MapReduce directly from the terminal using standard Unix string processing commands:
$ cat access10k.log | while read line; do echo "$line" | awk '{print $7}' | grep "\.json";done | sort | uniq -c | sort -nr 234 /example.com/music/data/artist.json 232 /example.com/music/data/songs.json 227 /example.com/music/data/influencers.json ...