Chapter 1. Let’s Not Be Boring
Welcome to the first chapter of a book about a niche section of computing. We’re supposed to set the stage here for the entire book (that’s what the publisher says), and really hook the reader (that would be you). We’re expected to talk about history, a handful of basics, and anything else that we can do to try and ease you into the topic of compression as gently (but interestingly) as possible. Without math. Because math is hard.1
But let’s be real, that’s boring for you to read, and for us to write.
So here’s what we’re going to do, instead. This book is about compression. And compression is all about the most compact representation of data. So, we’re going to run through this introductory stuff in the shortest, most compressed form possible.
First, we’re going to talk about buckets. Then, we’re going to introduce you to this rebel named Claude Shannon, who pretty much ruined our life while simultaneously creating every important thing that you love about computers. Finally, we are going to reveal to you the one essential thing you need to know about data compression. And without going out of our way (hardly!), we’ll make clear how compression pays off in better, cheaper, and faster apps.
Do we have a deal?
The Five Buckets of Compression Algorithms
Data compression algorithms are a really, really big space. Fortunately, these algorithms fall into a few buckets, which makes things a lot easier to understand. To throw the words at you, they are variable-length codes, statistical compression, dictionary encodings, context modeling, and multicontext modeling. Each of these five high-level buckets contains a horde of algorithm variations, which is a good thing; each variation differs slightly in intended input data, performance, memory constraints, and output sizes. Picking the correct variant means carrying out tests on your data and the encoders to find the one that works best.
Now, you can use these buckets together, because some buckets contain algorithms whose entire purpose is to transform the data so that another bucket can be more efficient at compressing it.
For you to be viewed as a compression guru, you need to understand the buckets, how they fit together, and what types of variants to use from which bucket for your own data sets.
Let’s get started.
Claude Shannon Is Infuriating!
Back in the 1940s, a statistical researcher named Claude Shannon published several papers detailing research he did while working in the military during World War II, and later at Bell Labs.
Claude was a pretty smart guy (and very good at math). Before he left the University of Michigan in 1936, he’d racked up bachelor’s degrees in engineering and mathematics. He then went on to do a bunch of crazy post-graduate stuff at the Massachusetts Institute of Technology, and his master’s thesis, “A Symbolic Analysis of Relay and Switching Circuits”, became the foundation of modern electrical switch-based computing.
In 1948, Shannon published A Mathematical Theory of Communication, which detailed how to best encode information that a sender wants to transmit, thus inventing the entire field of Information Theory. Messages can be encoded in many ways—think “alphabet” or “Morse code”—but for every message, there is a most efficient way to encode it, where “efficient” means using the fewest possible letters or symbols (or bits, or units of information). What “fewest” boils down to depends on the information content of the message. Shannon invented a way of measuring the information content of a message and called it information entropy.
Data compression is a practical application of Shannon’s research, which asks, “How compact can we make a message before we can no longer recover it?”2
So wait...why is he infuriating?
Well, although we can thank Mr. Shannon for helping to create the modern computers on which this book is being typed (and on which you’re most likely reading it), his work on information theory is directly the thing we’re trying to defeat. You can look at data compression as a rebellion against information entropy. Every compression algorithm computer scientists write tries to disprove Claude Shannon’s research, and compress the data further than its measured entropy. We scrape and pull and steal any bits we can from a message, to make it as small as possible, each time trying to break below Shannon’s definition of entropy and get to a new level of information understanding. Millions upon millions of hours of engineering time over the past 60 years have been solely dedicated to creating algorithms to defeat—or cleverly sneak around—a concept created by this brilliant man.
The Only Thing You Need to Know about Data Compression
OK, here’s what you need.
Data compression works via two simple ideas:
- Reduce the number of unique symbols in your data (smallest possible “alphabet”).
- Encode more frequent symbols with fewer bits (fewest bits for most common “letters”).
Boom. Done. That’s it.
Sixty years of compression research boiled down to two bullet points. Every single algorithm in data compression focuses on doing one of these two things. It transforms the data to be more compressible by shuffling or reducing the number of symbols, or it takes advantage of the fact that some symbols are more common than others, and encodes more common symbols with fewer bits.
What makes applied data compression so complex is that there’s a gazillion ways to do these two things, depending on the kind of data you have. You’ll need to take the following considerations into account:
-
Different data needs to be treated differently. Words in a book and floating-point numbers, for example, respond to very different algorithms.
-
Some data can be transformed first to make it more compressible.
-
Data might be skewed. For example, temperature data taken in summer might be skewed toward high temperatures; that is, it might contain a lot more high temperatures than near-freezing ones.
Your challenge as a programmer is to figure out the best way, or combination of ways, for compressing any block of data that a user throws at your application. And your challenge as a content developer is to figure out how to throw data at your users and not break their bank accounts.3
That, my dear adventurer, is what the rest of this book is about. It’s your field guide to understanding what in the compression world is worthy of your attention, and how the algorithms work conceptually, so that you can choose the right ones and apply them to your super awesome social/mobile/web/media application data.
A World Built on Data Compression
Let’s be clear about this: the computing world that you live in, right now, is built entirely on the back of data compression algorithms.
Yup. Every piece of it.
Every web page, image, song, cat video, streaming Internet movie, selfie, video game download, microtransaction, and OS update works only because of compression algorithms. In fact, you can’t throw a single bit of data around the Internet without running into some compressed content.
What’s so amazing about data compression technology is that it’s responsible for some of the largest changes in personal computing over the past 40 years, and no one knows about it.
For example, do you download or stream music instead of buying a CD? If so, you have compression to thank.
Music compression
See, in 1996, a joint working group (a bunch of smart people from different companies) unveiled the MP3 file format. This new audio format changed the nature of audio on computers. Until that point, the WAV file format was the most dominant and accepted format for creating, storing, and transferring audio data. Everyone used it, but the files were irresponsibly huge. A three-minute song could be roughly 30 MB in size and take around 9 minutes to download.4 Forget about streaming!
The invention of MP3 meant that anyone could get a full-length, three-minute song as about 1–3 MB of data at impressive audio quality levels.5 Users could even plop CDs into their computers and convert an entire album to the MP3 format to listen to digitally.
This combination of smaller file size and good quality gave birth to one of the biggest consumer innovations of our time: Napster. This service made it possible for people to trade MP3s with one another, free of cost. Of course, this opened up a massive legal problem: folks would buy a CD, convert the audio to MP3, and then share it with their friends, who never had to pay for the original disc. As you can imagine, the companies who make money off CD sales were infuriated and did everything within their power to successfully shut down the Napster service.
And so, the late 1990s/early 2000s were riddled with legal battles and governmental policy changes attempting to stop this kind of music sharing. There was even legislation proposed that would make the use of the MP3 format illegal.
Apple, rather than fighting this new digital phenomenon, decided to build a product around it. In 1998, it launched the iPod, one of the first portable devices dedicated to storing and playing MP3 files. With it came the iTunes Store, where customers could legally purchase MP3 files for personal use.6
Today digital music distribution has become the new normal, with a plethora of companies trying to find better ways to sell music to you.
The massive success of the iPod product eventually led to the development and release of the iPhone device, changing the face of personal computing forever. (But that’s a different story.)
Image compression
Let’s cycle back in time a bit further to the birth of the Internet. In 1978, when the first connections of the Internet structure were created, the amount of data sent was pretty minimal. The small number of users would primarily send and receive text data, or images that were created entirely out of characters, as demonstrated in Figure 1-1.7
The issue at hand was that real image information, stored in 24-bits-per-pixel format, was entirely too hefty for early connection modems. So, of course, the compression gurus went right at it. To test their new image compression algorithms, they needed a corpus of images. Being part of a male-dominated industry, they might have had a bias toward gentlemen’s magazines for source material; thus, they ended up with the now famous Lena image (see Figure 1-2), a picture of Lena Söderberg from the pages of the November 1972 issue of Playboy magazine.
When they unveiled the results of their research, they used a cropped, PG-13 version of the image in their paper, and provided the original version for others to test their own compression algorithms on, as well. For a long time, Lena was the gold-standard corpus image for testing the majority of image compression algorithms. Thankfully, since then, less controversial image corpora have been created. (The Kodak company’s image test suite is our personal favorite.) However, Lena is often still included as a litmus test in many image compression papers today.
Video compression
Fast-forward to 2001 and the launch of YouTube, a website where users could upload any video they recorded, for free, for everyone to see.
Until this point, the dominant way of sending around video information had been in the MOV format, which was nothing more advanced than a series of JPG images strung together in order. Unsurprisingly, the files were insanely large. So, the idea that you could just load a web page and watch a video was mind-boggling.
Genome mapping
In 2008, in an attempt to tackle disease and human mortality, scientists started to map and test the human genome. A single genome sequence represents an enormous amount of data—more than 14 GB just to describe the makeup of a human. These data sizes were larger than most systems were able to handle (and cloud computing hadn’t become a big thing yet).
Compression, once again, came to the rescue. Researchers were able to find that BWT8 was the most efficient way to store DNA information in a compressed form, and they could even perform operations on it without having to decompress it first.
By 2014, researchers had created one of the fastest protein folders on the planet, combining scalable cloud computing and compressed data transfer between host computers.
Compression and the economy
So, you see, compression has been at the heart of many massive changes in computing technology and culture. The reason for this lies in simple economic theory: compressed files are smaller files. Meaning, it takes less time to transfer them, and it costs less to do so, as well. Distributors pay less to distribute, and customers pay less to consume. In a modern world in which computing time is literally money, compression represents the most economically viable way to shorten the gap between content distributors and content consumers.
1 Yep, that’s the last time we’ll say this.
2 It’s important to note that according to modern information theory, there is a point at which removing any more bits removes the ability for you to uniquely recover your data stream properly. So, our compression goal is to remove as many bits as possible to get to this point, and then remove no more.
3 Because data plans are metered and horribly expensive in most parts of the world.
4 Check out “The Web Back in 1996–1997” for a historical detour.
5 Note that MP3 is a lossy data compression format; that is, some information is lost during compression. We briefly talk about this type of data compression later in the book.
6 The first portable MP3 player was launched in 1997 by SaeHan Information Systems, and AT&T set up the first streaming service.
7 ASCII art was actually invented by creative folks with typewriters.
8 We’ll explore Burrows–Wheeler transform deeply in Chapter 8.
Get Understanding Compression now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.