O'Reilly logo

Big Data for Chimps by Russell Jurney, Philip Kromer

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 6. Grouping Operations

Some content contributed by Q. Ethan McCallum (@qethanm)

In this chapter, we will introduce grouping operations in Pig and MapReduce. We’ll teach you the schemas behind grouped data, how to inspect and sample grouped data relations, how to count records in groups, and how to use aggregate functions to calculate arbitrary statistics about groups. We’ll teach you to describe and summarize individual records, fields, or entire data tables. In so doing, we’ll explore questions such as, “Does God hate Cleveland?” and “Who are the best players for each phase of their career?”

The GROUP BY operation is fundamental to data processing, both in MapReduce and in the world of SQL. In this chapter, we will cover grouping operations in Pig, which are one-liners, or one line of Pig code to perform. This is part of Pig’s power. We’ll learn how grouping operations relate to the reduce phase of MapReduce and how to combine map-only operations with GROUP BY operations to perform arbitrary operations on data relations.

Grouping operations are at the heart of MapReduce—they make use of and define the reduce operation of MapReduce, in which records with the same reduce key are grouped on a single reducer in sorted order. Thus it is possible to define a single MapReduce job that performs any number of map-only operations, followed by a grouping operation, followed by more map-only operations after the reduce. This simple pattern enables MapReduce to perform a wide array ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required