9. Building Data Transformation Workflows with Pig and Cascading

Collecting and processing large amounts of data can be a complicated task. Fortunately, many common data processing challenges can be broken down into smaller problems. Open-source software tools allow us to shard and distribute data transformation jobs across many machines, using strategies such as MapReduce.

Although frameworks like Hadoop help manage much of the complexity of taking large MapReduce processing tasks and farming them out to individual machines in a cluster, we still need to define exactly how the data will be processed. Do we want to alter the data in some way? Should we split it up or combine it with another source?

With large amounts of data coming from many ...

Get Data Just Right: Introduction to Large-Scale Data & Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.