Chapter 3. Case Study 2: Diskless

From 2012 through 2018, Technical Infrastructure (TI) teams rolled out a Google-wide change to production: to remove local disk storage for all jobs and move toward Diskless compute nodes, aka cloud disks.1 Such resource disaggregation reduced cost through improved server platform availability, tail latency, and disk utilization. This move to “prod without disk,” the vision of Technical Leads Eric Brewer, Luiz Barroso, and Sean Quinlan, was principally motivated by the following considerations:

  • The performance of a spinning disk was growing at a slower rate than that of CPU, SSD, or networking. Over time, the amount of storage “trapped” behind an interface increased faster than the speed of that interface.

  • Network-attached storage enabled migration of compute across machines, without losing storage while hugely accelerating the physical maintenance of machines. Storage-specific hardware eliminated the barrier to network-attached storage.

  • Separating compute and storage devices improved tail latency. Previously, hundreds of jobs competed for limited disk quota and bandwidth at the same time. In a Diskless world, you scheduled I/O with quotas and did parallel reads; that is, you sent three parallel read requests and used the first one that came back and canceled the others (“best of three”).

  • Independently provisioning compute and storage on different cycles improved datacenter total cost of ownership (TCO) and the ability to scale.

  • On shared ...

Get Case Studies in Infrastructure Change Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.