If having to manage virtual machines gets cumbersome at scale, it probably won’t come as a surprise to you that it was a problem Google hit pretty early on—nearly ten years ago, in fact. If you’ve ever had to manage more than a few dozen VMs, this will be familiar to you. Now imagine the problems when managing and coordinating millions of VMs.
At that scale, you start to re-think the problem entirely, and that’s exactly what happened. If your plan for scale was to have a staggeringly large fleet of identical things that could be interchanged at a moment’s notice, then did it really matter if any one of them failed? Just mark it as bad, clean it up, and replace it.
Using that lens, the challenge shifts from configuration management to orchestration, scheduling, and isolation. A failure of one computing unit cannot take down another (isolation), resources should be reasonably well balanced geographically to distribute load (orchestration), and you need to detect and replace failures near instantaneously (scheduling).
Pretty early on, engineers working at companies with similar scaling problems started playing around with smaller units of deployment using cgroups and kernel namespaces to create process separation. The net result of these efforts over time became what we commonly refer to as containers.
Google necessarily had to create a lot of orchestration and scheduling software to handle isolation, ...