Before we jump to loading data, a quick overview of MapReduce is warranted. Although Druid comes prepackaged with a convenient MapReduce job to accommodate historical data, generally speaking, large distributed systems will need custom jobs to perform analyses over the entire data set.
MapReduce is a framework that breaks processing into two phases: a map phase and a reduce phase. In the map phase, a function is applied to the entire set of input data, one element at a time. Each application of the
map function results in a set of tuples, each containing a key and a value. Tuples with similar keys are then combined via the
reduce function. The
reduce function emits another set of tuples, typically by combining the ...