Chapter 8. Analytics with Higher-Level APIs

In Chapter 6, we touched upon some of the motivations for working in a higher-level language such as Hive as opposed to native MapReduce, which can be difficult, unwieldy, and verbose even for relatively simple operations. Even experienced Java and MapReduce programmers find that most non-trivial Hadoop applications can entail a long development cycle, writing and chaining several mappers and reducers to form a complex job-chain or data processing workflow.

Furthermore, because MapReduce is designed to run in a batch-oriented fashion, it presents a number of limitations when performing data analysis that entails iterative processing (including many machine learning algorithms) or interactive data mining that requires responsive feedback. These criticisms of native MapReduce regarding development efficiency, maintenance, and runtime performance provide much of the motivation for both higher-level abstractions of Hadoop, and even a new processing engine that extends the MapReduce paradigm.

In this chapter, we introduce Pig, a programming abstraction of MapReduce that facilitates building MapReduce-based data flows. We also introduce some newer Spark APIs that extend the core RDD APIs by making it easier for developers to compute over structured data using familiar SQL-based concepts and syntax. These projects seek to boost developer productivity in programming MapReduce and Spark applications by providing expressive APIs that allow analysts ...

Get Data Analytics with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.