O'Reilly logo

Programming Hive by Jason Rutherglen, Dean Wampler, Edward Capriolo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 11. Other File Formats and Compression

One of Hive’s unique features is that Hive does not force data to be converted to a specific format. Hive leverages Hadoop’s InputFormat APIs to read data from a variety of sources, such as text files, sequence files, or even custom formats. Likewise, the OutputFormat API is used to write data to various formats.

While Hadoop offers linear scalability in file storage for uncompressed data, storing data in compressed form has many benefits. Compression typically saves significant disk storage; for example, text-based files may compress 40% or more. Compression also can increase throughput and performance. This may seem counterintuitive because compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.

Hadoop jobs tend to be I/O bound, rather than CPU bound. If so, compression will improve performance. However, if your jobs are CPU bound, then compression will probably lower your performance. The only way to really know is to experiment with different options and measure the results.

Determining Installed Codecs

Based on your Hadoop version, different codecs will be available to you. The set feature in Hive can be used to display the value of hiveconf or Hadoop configuration values. The codecs available are in a comma-separated list named io.compression.codec:

# hive -e "set io.compression.codecs"
io.compression.codecs=org.apache ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required