Chapter 8. Contextual Data Transforms

Before we begin with this chapter, let’s take a moment to recap.

Statistical encoders work by assigning a variable-length codeword to a symbol. Compression comes from smaller codewords being given to more frequently occurring symbols. The tokenization process of dictionary transforms works by identifying the longest, most probable symbols for a data set. Effectively, they find the best symbols for a set so that it can be encoded more efficiently. Technically speaking, we could just use the process to identify the best symbols and then plug that back into a statistical encoder to get some compression. However, the real power of the LZ method is that we don’t do that; instead, we represent matching information as a series of output pairs with lower entropy, which we then compress.

In addition to dictionary transforms, there’s an entire suite of other great transforms that work on the same principle: given some set of adjacent symbols, transform them in a way that makes them more compressible. We like to call these kinds of transforms “contextual,” because they all take into account preceding or adjacent symbols when considering ideal ways to encode the data.

The goal is always to transform the information in such a way that statistical encoders can come through and compress the results in a more efficient manner.

You could transform your data in lots of different ways, but there are three big ones that matter the most to modern data compression: ...

Get Understanding Compression now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.