October 2018
Beginner to intermediate
736 pages
17h 39m
English
It's likely that the most common or popular of the large-scale, cluster computing frameworks available is Hadoop. Hadoop is a collection of software that provides cluster computing capabilities across networked computers, as well as a distributed storage mechanism that can be thought of as a network-accessible filesystem.
Among the utilities it provides is Hadoop Streaming (https://hadoop.apache.org/docs/r1.2.1/streaming.html), which allows for the creation and execution of Map/Reduce jobs using any executable or script as a mapper and/or reducer. Hadoop's operational model, at least for processes that can use Streaming, is file-centric, so processes written in Python and executed under Hadoop will tend to fall into ...