234 PART III/APPLICATIONS
slow as is DMC; bzip2 strikes a middle groundmit gives better results than Gzip but is not an
on-line algorithm because it needs the entire text or blocks of text in memory to perform the
BWT transform. LZ77 methods (Gzip) are fastest for decompression, then the LZ78 technique,
and then Huffman coders, and the methods using arithmetic coding are the slowest. Huffman
coding is better for static applications, whereas arithmetic coding is preferable in adaptive and
on-line coding. Bzip2 decodes faster than most other methods and it achieves good compression
as well. A lot of new research on Bzip2 (see Section 10.4) has been carded out recently to push
the performance envelope of Bzip2 in terms of both compression ratio and speed, and as a result
bzip2 has become a strong contender to replace the popularity of Gzip and compress.
New research is under way to improve the compression performance of many of the algo-
rithms. However, these efforts seem to have come to a point of saturation regarding lowering the
compression ratio. To get a significant further improvement in compression, other means such as
transforming the text before actual compression and the use of grammatical and semantic informa-
tion to improve prediction models should be looked into. Shannon made some experiments with
native speakers of the English language and estimated that the English language has an entropy
of around 1.3 BPC. Thus, it seems that lossless text compression research is now confronted with
the challenge of bridging a gap of about 0.8 BPC in terms of compression ratio. Of course, com-
bining compression performance with other performance metrics like speed, memory overhead,
and on-line capabilities seems to pose even a bigger challenge.
10.4 TRANSFORM-BASED METHODS: STAR (*) TRANSFORM
AND LENGTH-INDEX PRESERVING TRANSFORM
In this section we present our research on new transformation techniques that can be used as
preprocessing steps for the compression algorithms described in the previous section. The basic
idea is to transform the text into some intermediate form, which can be compressed with better
efficiency. The transformation is designed to exploit the natural redundancy of the language. We
have developed a class of such transformations, each giving better compression performance over
the previous ones and most of them giving better compression over current and classical compres-
sion algorithms discussed in the previous section. We first present a brief description of the first
transform called the Star Transform (also denoted *-encoding). We then present four new trans-
forms called length-index preserving transform (LIPT), initial letter preserving transform (ILPT),
number index transform (NIT), and letter index transform (LIT), which produce better results.
The algorithms use a fixed amount of storage overhead in the form of a word dictionary for the
particular corpus of interest and must be shared by the sender and receiver of the compressed files.
The typical size of a dictionary for the English language is about 0.5 MB and can be downloaded
along with application programs. If the compression algorithms are going to be used over and
over again, which is true in all practical applications, the amortized storage overhead for the
dictionary is negligibly small. We will present experimental results measuring the performance
(compression ratio, compression times, and decompression times) of our proposed preprocessing
techniques using three corpora: the Calgary, Canterbury, and Gutenberg corpora.
10.4.1 Star (*) Transformation
The basic idea underlying the star transformations is to define a unique signature of a word by
replacing letters in a word with a special placeholder character (*) and keeping a minimum number
of characters to identify the word uniquely [ 15, 20, 21]. For an English language dictionary D
of size 60,000 words, we observed that we needed at most two characters of the original words

Get Lossless Compression Handbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.