Chapter 5. Statistical Encoding

Statistically Compressing to Entropy

A variable-length code (VLC) takes a given symbol from your input stream and assigns it a codeword that has a variable number of bits. When you apply this to a long-enough data set, you end up with an average bit-stream length that is close to the entropy of the data set—as long as the probability of your symbol distribution matches the probability table of the encoding scheme you used to build your VLC.

But let’s clear something up: apart from a few situations, the VLCs discussed in Chapter 4 aren’t  used much in the mainstream compression world. As we mentioned in that chapter, each code is built making different assumptions about the statistical probabilities of each symbol.

Consider the chaos of using these VLCs in the real world: you’re given an arbitrary data set, and you need to pick the best VLC to encode it, such that the final stream is close to the entropy size. Picking the wrong code for your data set can be disastrous, as you might end up with a compressed data stream that is bigger than the original!

So, you’d need to calculate your stream’s symbol probabilities and then test against all the known VLC codes to determine which one best matches the symbol distribution for your data set. And even then, there’s a chance that your data set might have a probability distribution that doesn’t match exactly with an existing VLC.1

The answer to this is a class of algorithms called statistical encoders, which, ...

Get Understanding Compression now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.