Chapter 14. Data Compression

Data compression is the process of reducing the number of bits used to represent data. It is one of the most significant results of information theory, an area of mathematics that addresses various ways to manage and manipulate information. Data compression entails two processes: in one process the data is compressed, or encoded, to reduce its size; in a second process it is uncompressed, or decoded, to return it to its original state.

To understand why data compression is possible, we must first understand that all data can be characterized by some informational content, called its entropy (a term borrowed from thermodynamics). Compression is possible because most data is represented with more bits than its entropy suggests is optimal. To gauge the effectiveness of compression, we look at the ratio of the size of the compressed data divided by its original size, and subtract this from 1. This value is known as the data’s compression ratio .

In the broadest sense, data compression methods are divided into two classes: lossy and lossless. In lossy compression we accept a certain loss of accuracy in exchange for greater compression ratios. This is acceptable in some applications, such as graphics and sound processing, provided the degradation is managed carefully. However, frequently we use lossless compression, which ensures that an exact copy of the original data is reproduced when uncompressed.

This chapter focuses on lossless compression, for which ...

Get Mastering Algorithms with C now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.