Chapter 4. Data Organization Patterns
In contrast to the previous chapter on filtering, this chapter is all about reorganizing data. The value of individual records is often multipled by the way they are partitioned, sharded, or sorted. This is especially true in distributed systems, where partitioning, sharding, and sorting can be exploited for performance.
In many organizations, Hadoop and other MapReduce solutions are only a piece in the larger data analysis platform. Data will typically have to be transformed in order to interface nicely with the other systems. Likewise, data might have to be transformed from its original state to a new state to make analysis in MapReduce easier.
This chapter contains several pattern subcategories as you will see in each pattern description:
The structured to hierarchical pattern
The partitioning and binning patterns
The total order sorting and shuffling patterns
The patterns in this chapter are often used together to solve data organization problems. For example, you may want to restructure your data to be hierarchical, bin the data, and then have the bins be sorted. See Job Chaining in Chapter 6 for more details on how to tackle the problem of combining patterns together to solve more complex problems.
Structured to Hierarchical
Pattern Description
The structured to hierarchical pattern creates new records from data that started in a very different structure. Because of its importance, this pattern in many ways stands alone in the chapter.
Intent
Transform ...