Chapter 6. Adaptive Statistical Encoding

Locality Matters for Entropy

All the statistical encoders mentioned in Chapter 5 require an initial pass through the data to compute probabilities before encoding can start. This leaves us with a few shortcomings: you always need to do an extra pass through the data to calculate the probabilities, and after you have calculated the probabilities for the entire data set, you are stuck with those numbers. Neither of those is a problem for relatively small data sets.

However, the larger the size of the data set that you’re compressing, the more bloated your statistical encoding will be with respect to entropy. This is because different subsections of the data set will have different probability characteristics. And if you’re dealing with streaming data—a video or audio stream, for example—there’s no “end” to the data set, so you just can’t “take two passes.”

These concepts, then, will apply to streaming data, but let’s look at them in the context of a relatively simple example data set. The first third of your stream might contain an excessive number of Q’s, whereas the last two thirds might have exactly none. The probability tables for your statistical encoder would not account for this locality. If the symbol Q has a probability of 0.01, it is expected to appear more or less evenly along the entirety of the stream; that is, about every 0.01th value would be Q.

This is not how real data works. There’s always some sort of locality-dependent ...

Get Understanding Compression now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.