O'Reilly logo

Understanding Compression by Aleks Haecky, Colton McAnlis

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Foreword

When I first began programming, I had no idea what data compression was nor why it mattered. Luckily, my Apple II Plus computer came with 0.000048 GB of memory (48 KB), which was quite a lot in 1979, and was enough to let me explore programming and computer graphics without realizing that my programs and data were constantly being compressed and decompressed behind the scenes in order to reduce their size in memory. Thanks, Woz!

After programming for a few years, I had discovered:

  • Data compression took time and could slow down my software.
  • Changing my data organization could make the compressed data smaller.
  • There are a bewildering variety of complicated data compression algorithms.

This led to the realization that compression was not a rigid black box; rather, it’s a flexible tool that greatly influenced the quality of my software and could be manipulated in several ways:

  • Changing compression algorithms could make my software run faster.
  • Pairing my data organization with the right compression algorithm could make my data smaller.
  • Choosing the wrong data organization or algorithm could make my data larger (and/or run slower).

Ah! Now I knew why data compression mattered. If things weren’t fitting into memory or were decompressing too slowly, I could slightly change my data organization to better fit the compression algorithm. I’d simply put numbers together in one group, strings in another, build tables of recurring data types, or truncate fractions into integers. I didn’t need to do the hard work of evaluating and adopting new compression algorithms if I could fit my data to the algorithm.

Then, I began making video games professionally, and most of the game data was created by not-so-technical artists, designers, and musicians. It turned out that math was not their favorite topic of discussion, and they were less than excited about changing the game data so that it would take advantage of my single go-to compression algorithm. Well, if the data organization couldn’t be improved, that left choosing the best compression algorithm to pair up with all of this great artistic data.

I surveyed the various compression algorithms and found there were a couple of broad categories suitable for my video game data:

Lossless
  • De-duplication (LZ)

  • Entropy (Huffman, Arithmetic)

Lossy
  • Reduced precision (truncation or decimation)

  • Image/video

  • Audio

For text strings and binary data, I used LZ to compress away repeating duplicate data patterns. Pixel data went through lossy vector quantization (VQ) to map pixels to a color palette. Audio data went through lossy decimation and linear predictive coding (LPC) to reduce the bits per second. The output of any of those compressors could then go into a lossless Huffman compressor for additional statistical entropy compression, if the CPU was fast enough.

During the 1980s and 1990s, I worked on about 30 games, most of which used these compression algorithms along with simple data build tools that performed limited optimizations of the data organization.

But then, around the year 2000, the situation became more complex. There is an ongoing arms race between data generation tools and data display and analysis. The consequences have been software performance, storage size, network congestion, and the efficient pairing of compression algorithms with data organization.

This data flood has been partially offset by larger storage (Blu-ray discs, terabyte hard drives, cloud storage), faster multicore CPUs, new lossless compression algorithms such as BWT, ANS, and PAQ, as well as dramatic improvements in lossy codecs for image, video, and audio data. However, data sizes are growing faster each year and dwarf the slow improvements in network bandwidth, compression algorithm improvements, and storage capacity.

Which brings us to the present and why this book matters.

How can a programmer learn which algorithms to use on their data and which data changes will help or hinder a particular algorithm? What would really help is an overview of the major data compression algorithms to guide developers through the myriad choices now available. Most developers don’t need to wade through all the theory and math details required to implement these algorithms; instead, they need a road map of the strengths and weaknesses of these algorithms, and how to take advantage of them for specific use cases.

I’ve greatly enjoyed implementing, using, and watching the evolution of data compression algorithms over the past 37 years. I hope this book will help demystify data compression and provide a starting point for software engineers to learn about compression algorithms and help them make better software.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required