CHAPTER 6 / DICTIONARY-BASED DATA COMPRESSION 165
trie-reverse trie pair as follows. Given string Q it is possible to compute the hash value of string
Qa,
the concatenation of Q with an arbitrary character a, from the hash value
h(Q)
in O(1) time by
using the fact that
h(Qa)
= [h(Q).
I~1 + a]
(mod p). Similarly one can compute the hash value
ofstring Q[2 : IQI]in
O(1)timeviathefactthath(Q[2:lQI]) = [h(Q)-lQl'lEl'a]
(mod p).
6.5 BENCHMARK PROGRAMS AND STANDARDS
Many variants of the LZ77 and LZ78 methods have become benchmarks for dictionary-based
compression. Almost all the archiving programs such as zip, PKZip, LHArc, ARJ, and
gzip make use of the LZ77 algorithm (or its LZSS variant) after the files are merged together
in a reversible fashion. Below we describe some of the better known benchmark compression
programs and some compression standards based on the dictionary compression paradigm.
6.5.1 The gzip Program
The
gzip
compression program was distributed by the Free Software Foundation as a utility for
the GNU operating system. It provides a modification of the LZSS compression method, often
giving good compression results in a reasonable amount of time. The main innovation in gzip
lies in searching for the "best" match in the uncompressed portion of the text. For this purpose
gzip builds a hash table to store three-character phrases observed in the compressed part of the
input. The hash table stores (the location of) every occurrence of each three-character phrase in
a linked list where the order of the occurrences is determined in a move-to-front fashion (i.e., the
more recently an occurrence has been accessed, the closer to the top of the link list it is located).
This approach improves the search time for cases where more recently accessed three-character
phrases (and characters following them in the input string) have a higher chance of being accessed
again. It is possible to limit the length of the linked lists corresponding to each hash value so as
to put an upper bound on the time to perform a search with a given three-character phrase. This,
however, may result in a decrease in compression performance.
The gzip program is also capable of performing greedy parsing with one-step lookahead and
return not the longest phrase but one which is shorter if the next phrase obtained is composed of
a single character. Note that for the general implementation of LZ77, such a lookahead cannot
provide any improvement in compression; this becomes an issue in gzip only because it does
not maintain all the dictionary entries (i.e., all substrings of the compressed portion of the input)
implicitly suggested by LZ77. The gzip program further tries to improve the compression
performance of LZ77 by Huffman coding the
offset
and
length
entries of each codeword separately.
6.5.2 The
compress
Program
Developed as a UNIX operating system utility, the
compress
program is essentially based on
the LZW compression method. The dictionary is initialized with 256 entries corresponding to
the possible combinations of 8 bits (that is, that this program processes the file to be compressed
in a byte-wise fashion). The maximum size of the dictionary is variable, so it could be set to a
value in the range 29, 21~ ..... 216. At any step of the execution of the algorithm each codeword is
encoded by b = [log 2 (IDI + 1)] bits, where IDI is the size of the dictionary. Once the dictionary
size reaches the upper limit, a statistical check is performed at regular intervals so as to determine
whether reasonable compression is being achieved without any alterations in the existing state of
the dictionary. If the compression ratio falls below a preset threshold, the dictionary is reset to the
initial state.

Get Lossless Compression Handbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.