Chapter 4, “Advanced Analytical Theory and Methods: Clustering,” through Chapter 9, “Advanced Analytical Theory and Methods: Text Analysis,” covered several useful analytical methods to classify, predict, and examine relationships within the data. This chapter and Chapter 11, “Advanced Analytics—Technology and Tools: In-Database Analytics,” address several aspects of collecting, storing, and processing unstructured and structured data, respectively. This chapter presents some key technologies and tools related to the Apache Hadoop software library, “a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models” .
This chapter focuses on how Hadoop stores data in a distributed system and how Hadoop implements a simple programming paradigm known as MapReduce. Although this chapter makes some Java-specific references, the only intended prerequisite knowledge is a basic understanding of programming. Furthermore, the Java-specific details of writing a MapReduce program for Apache Hadoop are beyond the scope of this text. This omission may appear troublesome, but tools in the Hadoop ecosystem, such as Apache Pig and Apache Hive, can often eliminate the need to explicitly code a MapReduce program. Along with other Hadoop-related tools, Pig and Hive are covered in a portion of this chapter dealing ...