CHAPTER 14 / COMPRESSION OF UNICODE FILES
295
The important point is how a given compressor behaves on different versions of the same file
rather than the relative performance of the compressors on each filemthe differences
between
compressors are well known.
Data compression results are traditionally given in output bits per input byte, or bits per byte,
with a recent move to give performance in bits per character (bpc). As Unicode no longer retains
the identity between bytes and characters, the procedure in this paper is to present all results in
bits per character, related to the 16-bit units of the UCS-2 files. Thus ASCII files are shown as
bits per byte, while Unicode results are always in bits per Unicode character. The compression of
UTF-8 files, considered as arbitrary files without respect to their encoded information, is a quite
different matter, which is left until Section 14.6.
14.5 COMPARISONS
Results for the different compressors and file formats are shown in Table 14.4. In all cases the
results are simply given as bits per character, averaged over each corpus. (The compression for
UCS-32 often looks poor, but remember that it is given in bits per two bytes and the value must
be halved for comparison with the traditional "bits per byte".)
9 The "split" files are not successful, because the division destroys too much contextual infor-
mation.
9 The constant-order PPM compressor shows the expected degradation on the UCS-2 files.
9 Unix Compress also gives poor performance on the UCS-2 files.
9 The two 8-bit LZ-77 compressors, GZIP and LZU-8, both give a small deterioration in
UCS-2 compared with UTF-8.
9 The 16-bit LZ-77 compressor LZU- 16 gives quite different results for the different "endian"
files, emphasizing the need to get the alignment correct, even in the full 16-bit character
mode.
Table 14.4 Compression Results, in Bits per Character
UCS-2 UCS-2
UTF-8 Big-endian Little-endian Split
COMP2 order-4 ASCII 2.38 2.95 2.95 2.38
Constant order PPM Unicode 5.47 5.96 6.06 6.12
BZIP ASCII 2.30 2.31 2.31 2.31
Burrows-Wheeler Unicode 5.03 5.24 5.39 5.79
Compress ASCII 3.66 4.55 4.55 3.72
LZ-78 or LZW Unicode 7.87 7.91 7.30 7.30
GZIP ASCII 2.67 3.25 3.25 2.68
LZ-77 Unicode 5.69 6.29 6.32 6.13
LZU-8 ASCII 3.24 4.30 4.30 3.24
LZ-77, 8-bit mode Unicode 6.76 7.61 7.72 7.39
LZU- 16 ASCII 4.21 3.39 3.65 4.20
LZ-77, 16-bit mode Unicode 8.60 6.54 7.97 9.04

Get Lossless Compression Handbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.