Chapter 1. In The Beginning…
Cloud computing has come a long way.
Just a few years ago there was a raging religious debate about whether people and projects would migrate en masse to public cloud infrastructures. Thanks to the success of providers like AWS, Google, and Microsoft, that debate is largely over.
In the “early days” (three years ago), managing a web-scale application meant doing a lot of tooling on your own. You had to manage your own VM images, instance fleets, load balancers, and more. It got complicated fast. Then, orchestration tools like Chef, Puppet, Ansible, and Salt caught up to the problem and things got a little bit easier.
A little later (approximately two years ago) people started to really feel the pain of managing their applications at the VM layer. Even under the best circumstances it takes a brand new virtual machine at least a couple of minutes to spin up, get recognized by a load balancer, and begin handling traffic. That’s a lot faster than ordering and installing new hardware, but not quite as fast as we expect our systems to respond.
Then came Docker.
Just In Case…
If you have no idea what containers are or how Docker helped make them popular, you should stop reading this paper right now and go here.
So now the problem of VM spin-up times and image versioning has been seriously mitigated. All should be right with the world, right? Wrong.
Containers are lightweight and awesome, but they aren’t full VMs. That means that they need a lot of orchestration to run efficiently and resiliently. Their execution needs to be scheduled and managed. When they die (and they do), they need to be seamlessly replaced and re-balanced.
This is a non-trivial problem.
In this book, I will introduce you to one of the solutions to this challenge—Kubernetes. It’s not the only way to skin this cat, but getting a good grasp on what it is and how it works will arm you with the information you need to make good choices later.
Who I Am
Full disclosure: I work for Google.
Specifically, I am the Director of Global Cloud Support and Services. As you might imagine, I very definitely have a bias towards the things my employer uses and/or invented, and it would be pretty silly for me to pretend otherwise.
That said, I used to work at their biggest competitor—AWS—and before that, I wrote a book for O’Reilly on Cloud Computing, so I do have some perspective.
I’ll do my best to write in an evenhanded way, but it’s unlikely I’ll be able to completely stamp out my biases for the sake of perfectly objective prose. I promise to keep the preachy bits to a minimum and keep the text as non-denominational as I can muster.
If you’re so inclined, you can see my full bio here.
Finally, you should know that the words you read are completely my own. This paper does not reflect the views of Google, my family, friends, pets, or anyone I now know or might meet in the future. I speak for myself and nobody else. I own these words.
So that’s me. Let’s chat a little about you…
Who I Think You Are
For you to get the most out of this book, I need you to have accomplished the following basic things:
Spun up at least three instances in somebody’s public cloud infrastructure—it doesn’t matter whose. (Bonus points points if you’ve deployed behind a load balancer.)
Have read and digested the basics about Docker and containers.
Have created at least one local container—just to play with.
If any of those things are not true, you should probably wait to read this paper until they are. If you don’t, then you risk confusion.
Containers are really lightweight. That makes them super flexible and fast. However, they are designed to be short-lived and fragile. I know it seems odd to talk about system components that are designed to not be particularly resilient, but there’s a good reason for it.
Instead of making each small computing component of a system bullet-proof, you can actually make the whole system a lot more stable by assuming each compute unit is going to fail and designing your overall process to handle it.
All the scheduling and orchestration systems gaining mindshare now— Kubernetes or others—are designed first and foremost with this principle in mind. They will kill and re-deploy a container in a cluster if it even thinks about misbehaving!
This is probably the thing people have the hardest time with when they make the jump from VM-backed instances to containers. You just can’t have the same expectation for isolation or resiliency with a container as you do for a full-fledged virtual machine.
The comparison I like to make is between a commercial passenger airplane and the Apollo Lunar Module (LM).
An airplane is meant to fly multiple times a day and ferry hundreds of people long distances. It’s made to withstand big changes in altitude, the failure of at least one of its engines, and seriously violent winds. Discovery Channel documentaries notwithstanding, it takes a lot to make a properly maintained commercial passenger jet fail.
The LM, on the other hand, was basically made of tin foil and balsa wood. It was optimized for weight and not much else. Little things could (and did during design and construction) easily destroy the thing. That was OK, though. It was meant to operate in a near vacuum and under very specific conditions. It could afford to be lightweight and fragile because it only operated under very orchestrated conditions.
Any of this sound familiar?
VMs are a lot like commercial passenger jets. They contain full operating systems—including firewalls and other protective systems—and can be super resilient. Containers, on the other hand, are like the LM. They’re optimized for weight and therefore are a lot less forgiving.
In the real world, individual containers fail a lot more than individual virtual machines. To compensate for this, containers have to be run in managed clusters that are heavily scheduled and orchestrated. The environment has to detect a container failure and be prepared to replace it immediately. The environment has to make sure that containers are spread reasonably evenly across physical machines (so as to lessen the effect of a machine failure on the system) and manage overall network and memory resources for the cluster.
It’s a big job and well beyond the abilities of normal IT orchestration tools like Chef, Puppet, etc….