Chapter 12. Safely Operating at Scale

Who

  • Engineers

  • Enabling Team Leads

  • Technical Leads

Why

Once you have an application running on a cloud platform that your users care about, you need to consider its reliability. After all, it doesn’t matter how good it is if your users can’t depend on it. This is why we advocate a user-centric approach—will your customers care that the product has exceptional design and great features if they can’t trust that it works?

At Google, we consider reliability an engineering discipline equal in importance to software engineering itself. This work, however, is typically undertaken by engineers who demonstrate a particular proclivity toward large-scale architectural and systems thinking. Furthermore, we intentionally think of all these aspects of engineering as an intrinsic cost of delivering our services, and therefore they are intentionally long-running. This regular cadence of improvement and reduction of technical debt ensures the longevity of our services. We believe this is a critical element of any cloud transformation journey.

How

Once you decide that you must care about the reliability of a specific service, measuring and assessing the user experience of that service is the first critical step you should take. We know there are a myriad of ways you can quantify reliability. On the web, the simplest starting point is to measure the error rate and latency as seen by your users.

The ...

Get A Practical Guide to Cloud Migration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.