O'Reilly logo

Hadoop Operations and Cluster Management Cookbook by Shumin Guo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Using compression for input and output

A typical MapReduce job uses parallel mapper tasks to load data from external storage devices, such as hard drives to the main memory. When a job finishes, the reduce tasks write the result data back to the hard drive. In this way, during the life cycle of a MapReduce job, many data copies are created when data is relayed between the hard drive and the main memory. Sometimes, the data is copied over the network from a remote node.

Copying data from and to hard drives and transfers over the network are expensive operations. To reduce the cost of these operations, Hadoop introduced compression on the data.

Data compression in Hadoop is done by a compression codec, which is a program that encodes and decodes ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required