Chapter 11. Other File Formats and Compression

One of Hive’s unique features is that Hive does not force data to be converted to a specific format. Hive leverages Hadoop’s InputFormat APIs to read data from a variety of sources, such as text files, sequence files, or even custom formats. Likewise, the OutputFormat API is used to write data to various formats.

While Hadoop offers linear scalability in file storage for uncompressed data, storing data in compressed form has many benefits. Compression typically saves significant disk storage; for example, text-based files may compress 40% or more. Compression also can increase throughput and performance. This may seem counterintuitive because compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.

Hadoop jobs tend to be I/O bound, rather than CPU bound. If so, compression will improve performance. However, if your jobs are CPU bound, then compression will probably lower your performance. The only way to really know is to experiment with different options and measure the results.

Determining Installed Codecs

Based on your Hadoop version, different codecs will be available to you. The set feature in Hive can be used to display the value of hiveconf or Hadoop configuration values. The codecs available are in a comma-separated list named io.compression.codec:

# hive -e "set io.compression.codecs"
io.compression.codecs=org.apache ...

Get Programming Hive now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.