December 2018
Beginner to intermediate
684 pages
21h 9m
English
Data management at scale, that is, hundreds of GB and beyond, require the use of multiple machines that form a cluster to conduct read, write and compute operations in parallel, that is, it is distributed over various machines.
The Hadoop ecosystem has emerged as an open-source software framework for distributed storage and processing of big data using the MapReduce programming model developed by Google. The ecosystem has diversified under the roof of the Apache Foundation and today includes numerous projects that cover different aspects of data management at scale. Key tools within Hadoop include: