For all of these codes, an integer N occurring with probability PN is encoded into or represented
by a codeword of L binary digits. A fundamental result of information theory is that for optimal
coding all codewords have a length L = log2(1 ~IN) bits, leading to the important general rule
that frequent values should have short codewords and less frequent values should have longer
Efficiency: An efficient code, or compact code, is assumed to be one which, for each value,
approximates the minimum codeword length in comparison with other codes. In many codes
the codeword length is nearly a constant multiple of the binary representation. For these
codes the expansion relative to binary is a useful indication of efficiency.
Length indication: Every universal representation must somehow indicate its length or
signal the end of the bit stream for the codeword. Some representations are terminated by
a special bit pattern or comma. Others have an explicit length as part of the representation,
the length itself often using one of the terminated representations.
It is pleasant for human understanding, but certainly not essential, if the
value is easily extracted once a codeword is recognized. Some codes, for example, have a
portion of the codeword as the natural binary representation of the value.
The increasingly subtle coding techniques used to increase code efficiency
often reduce the redundancy of the code. But the sensitivity to transmission errors usually
increases with the decreasing redundancy. In extreme cases a single bit error can lead to
catastrophic error propagation and loss of codeword demarcation. Codes with a terminating
pattem or comma tend to be more robust and those with an encoded length less robust.
This describes a code that is self-contained, in that a codeword can be
decoded without reference to adjoining codewords. A non-instantaneous code is fine in a
long sequence of codewords from the same code, but may be inconvenient if the compact
codes are interspersed with other, instantaneous, codes.
As a final introductory note, within this chapter the term bit will always mean binary digit
rather than an information measure. A codeword (the encoded collection of bits to represent the
value) is usually longer than the binary representation of the value (also in bits).
Many representations for an integer N combine a visible digit vector d with an implicit weight
vector w, such that N = d- w, the scalar product of the two vectors. A polynomial in some base
b results if successive terms of the weight vector are given by Wi+l ~ bwi. While an obvious
example of a polynomial representation is the familiar decimal representation, it is more instructive
to consider some other forms.
(b = 2) The weight vector is powers of 2, least to the fight, or w = { .... 128, 64,
32, 16, 8, 4, 2, 1 } and the visible digits di ~ {0, 1 }. The (decimal) value 23 is represented
by 10111
23=lx16+0x8+lx4+lx2+l x 1.
(b = 3) The weight vector is now w = { .... 729, 243, 81, 27, 9, 3, 1} and
di ~ {0, 1, 2}. 20910 is represented by 21202
209=2 81+ 1 •215 32+2 1.

Get Lossless Compression Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.