Chapter 11. Other File Formats and Compression
One of Hive’s unique features is that Hive does not force data
to be converted to a specific format. Hive leverages Hadoop’s
InputFormat APIs to read data from a variety of
sources, such as text files, sequence files, or even custom formats.
OutputFormat API is used to
write data to various formats.
While Hadoop offers linear scalability in file storage for uncompressed data, storing data in compressed form has many benefits. Compression typically saves significant disk storage; for example, text-based files may compress 40% or more. Compression also can increase throughput and performance. This may seem counterintuitive because compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.
Hadoop jobs tend to be I/O bound, rather than CPU bound. If so, compression will improve performance. However, if your jobs are CPU bound, then compression will probably lower your performance. The only way to really know is to experiment with different options and measure the results.
Determining Installed Codecs
Based on your Hadoop version, different codecs will be
available to you. The
set feature in
Hive can be used to display the value of
hiveconf or Hadoop configuration values. The
codecs available are in a comma-separated list named