One of the oldest and most elegant forms of data compression is Huffman coding, an algorithm based on minimum redundancy coding. Minimum redundancy coding suggests that if we know how often different symbols occur in a set of data, we can represent the symbols in a way that makes the data require less space. The idea is to encode symbols that occur more frequently with fewer bits than those that occur less frequently. It is important to realize that a symbol is not necessarily a character of text: a symbol can be any amount of data we choose, but it is often one byte’s worth.
To begin, let’s revisit the concept of entropy introduced at the beginning of the chapter. Recall that every set of data has some informational content, which is called its entropy. The entropy of a set of data is the sum of the entropies of each of its symbols. The entropy S of a symbol z is defined as:
Sz = -lgPz
where Pz is the probability of z being found in the data. If it is known exactly how many times z occurs, Pz is referred to as the frequency of z. As an example, if z occurs 8 times in 32 symbols, or one-fourth of the time, the entropy of z is:
-lg(1/4) = 2 bits
This means that using any more than two bits to represent z is more than we need. If we consider that normally we represent a symbol using eight bits (one byte), we see that compression here has the potential to improve the representation a great deal.
Table 14.1 presents an example ...