Often, mid-sized organizations find themselves in a position in which a relatively small number of engineers must develop and run a relatively large number of diverse features.
SoundCloud has grown into exactly that situation. With each new feature added to the original monolithic Ruby on Rails code base, adding the next feature became more difficult. So around 2012, we began a gradual move to a microservices architecture. SoundCloud engineers have talked a lot about the various challenges that needed to be tackled for such a move to succeed.1 In this chapter, we explore lessons learned from reliably running hundreds of services at SoundCloud with a much smaller number of engineers.
In 2012, SoundCloud happened to hire a couple of former Google SREs. Although dramatically smaller in scale, SoundCloud was moving toward technological patterns not so different from what larger internet companies had been doing for a while. By extension, it was an obvious move to also run those systems in the same way Google does. We tried “SRE by the book,” except that back then there was no actual book.
What is the smallest reasonable size of an SRE team? Because SREs ought to be on-call, the team needs to be large enough for at least one on-call rotation. Following the best practices for on-call ...