Preface

Data compression is everywhere, and it’s as utterly essential for modern computing as it was, when one megabyte was a lot, and data was transferred in kilobits per second. In a sense, we have come full circle, from antique computers with limited memory and bandwidth, to mobile devices with limited memory and costly data plans.

Fortunately, there are tools, APIs, and packages that can compress your data for you. Understanding how they work will help you to choose the right compression for your data, which will directly translate into happier users, cost savings, and more revenue for you.

Data compression is built on math, and let’s face it, for most mere mortals, math is hard. Like, really hard. Like, it used to be one of the hardest things about being a programmer. Imagine Claude Shannon, the father of data compression, who was really good at math, hacking away at some chalkboard, scribbling rows and rows of crazy complex equations.

What’s even more crazy, is that modern programmers don’t really need to know math. Any eight-year-old kid in the world can jump online, work through a tutorial, and publish their own web page or application before even enrolling in Algebra class.

And this is, we believe, why the field of data compression has been stagnant over the past 20 years. Despite the fact that two billion1 people are on mobile devices and regularly experience problems with memory capacity and poor Internet connectivity, data compression remains a semi-stagnant computational technology. Because programmers don’t know math.

Because math is hard.

You see, compression isn’t really about data. The early founders of data compression weren’t thinking about data. They were thinking about statistics. They were looking for, and found, different ways to manipulate the probability distributions of symbols in data sets, and exploit those trends to produce smaller data sets that contained the same information.

As computing became more common and less mathematical, the average programmer needed to know less and less about statistics and other advanced math. And so, despite the early 2000s bringing the largest technology boom in computing history, data compression has had maybe two or three advances in its entire scientific field.

Because compression is hard.

Because it’s built on math.

Now, let’s be fair and practical here. Today, most programmers and content developers don’t need to know advanced math or understand how compression works, because they can just grab a decent data compression library, throw their data at it, and ship it out into the wild-blue yonder.

However, moving forward, this is not going to be enough. Predictions are that by 2025, five billion humans will be using computers and transferring data over the Internet. Considering that data production has gone through the roof, we’re about to have too much data, with carriers that can’t transfer it fast enough, and data warehouses too small to hold it all. Of course, one solution is faster, better compression, using innovative algorithms that have yet to be invented.

Using math.

Which is hard.

The other solution is to teach anyone who will listen how compression works. So, instead of grabbing some random compression tool, you can choose the very best compression and get your data to your users in the most efficient manner.

That’s where this book comes from. It’s an attempt to minimize the vast, enormous craziness that is the science of data compression, and reduce it to something that mere mortals can understand and apply to their daily data needs. We will try to explain the fundamentals of data compression with tables, diagrams, and data flows—and as little math as possible. Much like Colt’s YouTube video series Compressor Head, this book hopes to teach compression to pretty much anybody who’s survived high school—even if you’re not a programmer.

But let’s be honest: if you want to really understand this stuff, you’ll need to do the mental gymnastics. Compression, like riding a bicycle, is difficult until your mind goes “grok,” and then it all makes perfect sense. But you must stick it out and work through the examples to get there.

Just to be clear, our goal is not to make you a compression expert. That would require pretty heavy mathematics (which is hard!). Our goal is to make you understand compression algorithms. Sometimes, this will mean using the proper terminology; sometimes it will mean using wrong but far more descriptive terminology. We’re not trying to prepare you for water-cooler conversations with other compression folk. We want to give you enough information that you can make the right business decisions about compression.

Finally, and honestly, data compression is also really cool. Well, we think so, and we hope that you’ll think so, too, as you are delving into this book.

We had fun writing this book, and we hope you’ll have fun digging into the science of data compression.

How to Read This Book

Like any good story, this book answers all the W questions. What is data compression and why do you want to know about it? When was data compression invented? Who are the people who dedicated their lives to eliminating a few more bits? Where in your product development cycles should you care about the size of your data? And most importantly, how does it all work to save you bits, money, and your user’s data plans?

This being an actual, printed volume of pages organized in chronological order, we strongly suggest you start at the beginning and work your way through each chapter. You see, each chapter builds on the previous chapter, not just historically, but also introducing terminology and evolving algorithms. We built the book to be read in an orderly fashion, and the easiest way through the material is to follow that path.

How to Read This Book Backwards

If you are the kind of person whose eyes sparkle more at the sight of money than intricate algorithms, you may read this book backwards. Let yourself be thoroughly convinced that data compression is the most amazing thing since sliced bread (with butter!), and then, energized by those convictions, tackle the work of understanding how compression actually works (because understanding it will result in, you guessed it, more money). Ready?

Chapter Synopsis

Let’s face it, chapter synopsis chapters are boring, and we promised you that, like William Goldman in The Princess Bride, we’d stick with just the juicy bits. So, while we ask you not to skip chapter two, feel free to skip this synopsis. Skim the table of contents to find out about, well, the contents of this book, or just go ahead and read the book since you are already holding it in your hands. But if you want to dip your toes into the topic before taking the plunge, here is a little warning on what you are getting yourself into.

Chapter 1, Let’s Not Be Boring

In this chapter we tell you everything you need to know if you don’t have time to actually read this book. We can divide compression algorithms into five buckets: variable-length codes, statistical compression, dictionary encodings, context modeling, and multi-context modeling. Claude Shannon invented a way of measuring the information content of a message and called it Information Entropy. The whole point of compression is to encode data using fewest possible symbols into the fewest possible bits. The foundation of the whole internet is data compression. You should compress all your data. That’s it. You can stop reading now.

Chapter 2, Do Not Skip This Chapter

Don’t skip this chapter because it lays out the fundamentals, such as how to represent the whole world in zeros and ones, an introduction to Information Theory, and Entropy as the Minimum Bits Needed to Represent a Number.

Chapter 3, Breaking Entropy

According to Claude Shannon, Entropy puts a limit on how small you can make a data set. Compression is all about breaking this limit by exploiting two properties of real data: ordering and relationships between symbols.

Chapter 4, Variable-Length Codes

You’ll learn how to string together 0s and 1s to make unique, variable length codewords, and then assign the shortest ones to the most probable symbols in the dataset. And you’ll meet Peter Elias.

Chapter 5, Statistical Encoding

One size never fits all, and statistical encoders create custom VLCs that are optimized for particular datasets. You’ll build a Huffman tree using sticky notes, explore arithmetic coding, and meet Jarek Duda who usurped them both by introducing arithmetic numerical systems.

Chapter 6, Adaptive Statistical Encoding

Real data streams change, and adaptive encoders optimize themselves according to the local properties of the data they are processing.

Chapter 7, Dictionary Transforms

If you can’t compress your data as is any further, you can preprocess it into a more compressible form by considering “words”, and encoding how they are repeated in the data set, before applying statistical compression.

Chapter 8, Contextual Data Transforms

There are as many different transforms as there are data sets, but in modern computing there are a couple of big ones that matter the most: run-length encoding and delta encoding. What they have in common is that they consider what has come before to determine what is going to happen next.

Chapter 9, Data Modeling

Multicontext encoders take into account the last few observed symbols in order to identify the ideal number of bits for encoding the current symbol. Think of it as creating a brand new huffman tree each time you read in a new symbol. The fancy words include Markov chains, prediction by partial matching, and tries. Use them to impress your friends.

Chapter 10, Switching Gears

This chapter takes a short excursion into media-specific compression and opens the door to becoming practical about your data.

Chapter 11, Evaluating Compression

For every data stream, there’s a different compression algorithm. And for every use case, there’s a...different compression algorithm. Learn about all the things you need to consider before picking what’s best for your needs.

Chapter 12, Compressing Image Data Types

If you’re an application developer, chances are that the bulk of your content is image data. Save bits in images by exploiting patterns and cleverly omitting information in a way that the human brain won’t notice. We’ll look at PNG, JPG, GIF, and WebP.

Chapter 13, Serialized Data

Serialized content is the second most common data format you’ll be sending around in your networked applications. Kick JSON and XML in favor of binary serialization formats.

Chapter 14, Lossy Data Compression

Actually, that’s a different book.

Chapter 15, Making the World a Little Smaller

Why you must care about data compression to grow your audience, your business, and your bottom line.

1 This is the number as of 2015. If you’re reading this book some time in the future, it will be different. Also, thanks for reading the book! Also, how did you survive the robo-apocalypse?

Get Understanding Compression now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.