Chapter 6. Operating Platforms
Rare things become common at scale.
Jason Cohen1
No matter how well you build a platform, the systems it depends on are complex, so it will inevitably have operational issues. As useful as the product mindset is for platforms, product-focused teams can underinvest in operations when times are good—they move fast and deliver lots of great features, but pile up operational debt along the way. A successful application team might be able to get away with this, because their contributions to the business’s top line are rewarded with extra headcount, which makes it possible to stay ahead of the debt. But that’s not the situation most platform teams are in.
Platforms create their value through leverage, and one aspect of leverage is efficiency—supporting substantially more scale without needing to hire more people into the platform team. However, as this chapter’s introductory quote suggests, this is in conflict with the fact that systems often run into new problems just because of scale, particularly operationally. This means constant-sized teams supporting scaling platforms can wind up in “operational hell,” where neglected operational problems start having ongoing acute business impact, eroding customer trust. As the system is handling critical load at scale, it can take months to remediate the acute impact and years to address the core issues, and all the while new product features are stalled.
To avoid this, platform teams need to routinely invest ...
Get Platform Engineering now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.