Preface
This book was born out of a production-readiness initiative I began running several months after I joined Uber Technologies as a site reliability engineer (SRE). Uber’s gigantic, monolithic API was slowly being broken into microservices, and at the time I joined, there were over a thousand microservices that had been split from the API and were running alongside it. Each of these microservices was designed, built, and maintained by an owning development team, and over 85% of these services had little to no SRE involvement, nor any access to SRE resources.
Hiring SREs and building SRE teams is an absurdly difficult task, because SREs are probably the hardest type of engineers to find: site reliability engineering as a field is still relatively new, and SREs must be experts (at least to some degree) in software engineering, systems engineering, and distributed systems architecture. There was no way to quickly staff all of the teams with their own embedded SRE team, and so my team (the Consulting SRE Team) was born. Our directive from above was simple: find a way to drive high standards across the 85% of microservices that had no SRE involvement.
Our mission was simple, and the directive was vague enough that it allowed me and my team a considerable amount of freedom to define a set of standards that every microservice at Uber could follow. Coming up with high standards that could apply to every single microservice running within this large engineering organization was not ...