Chapter 4. Culture, Governance, and Organization

As previously discussed, striving to be Always On demands strategic decision making from the highest levels of an organization. This involves adopting a modern and unique approach of governing and a new method of operations.

Mature CIO organizations establish business service classification review boards to determine the appropriate service tier and the end-to-end SLO required for a specific mission-critical business service. They consider factors such as service criticality, continuous reliability expectations, development time, operational expenses, and the financial consequences of potential outages.

When organizations discuss promoting the need for reliable services, they often encounter doubts or inquiries about the feasibility of achieving it from both business and technical perspectives. It is vital, then, to foster a culture that embraces continuous ops, or as it is known by its newer name, site reliability engineering (SRE), a framework that brings together business, architects, application owners, development, and operations teams.

Site reliability engineers evaluate and approve proposals for implementing Always On for a nominated business service after the need for Always On has been socialized organization-wide. They conduct a fit analysis to ensure alignment with existing Always On patterns and guiding principles, accelerating the delivery of customer-focused reliable services and improving decision making.

In this chapter, I briefly describe SRE and chaos engineering, concepts that are instrumental not only in operating mission-critical services but also in testing, validating, and building confidence in their reliability.

Site Reliability Engineering

Site reliability engineering, pioneered by Google, is an engineering discipline that focuses on applying software engineering principles and practices to operations. In an Always On environment, SREs are tasked with establishing and adhering to end-to-end SLOs, which often entails striking a balance between development velocity (such as new deployments, changes, and new application features) and reliability, all while maintaining a customer-centric mindset. SREs implement automation, observability, testing, and incident response strategies to ensure that applications meet these objectives and continuously improve over time.

The journey toward adopting SRE starts with a cultural shift in the organization, emphasizing the importance of a blameless postmortem culture. This approach focuses on learning from incidents and improving systems rather than assigning blame. By fostering collaboration among development, operations, and other crossfunctional teams, it becomes easier to dismantle silos and align everyone’s efforts toward common objectives. Ultimately, an SRE culture cultivates an approach of continuous improvement, enabling teams to proactively detect and address potential reliability concerns at an early stage.

When managing Always On services, SRE teams must have a comprehensive understanding of both the functioning and potential failure points of the end-to-end application flow and underlying infrastructure. This means that the SRE team responsible for life-cycle operations should be actively involved in the design, implementation, testing, and validation stages. This involvement ensures prompt feedback and corrective measures if the design fails to meet all the guiding principles for Always On.

SRE is a wide topic. I recommend exploring the literature that Google regularly maintains for a deeper understanding. However, for the scope of this report, I will focus on the following specific areas:

Build to manage
Error budgets
Observability and proactive service management
Declarative and continuous deployment
Graceful location scope–based updates

Let’s explore each of these in more detail.

Build to Manage

Build to manage refers to the practice of designing and developing systems with manageability, observability, and reliability as integral components from the very beginning. It encompasses a collection of manageability features incorporated within the application as part of the development and release process.

The build-to-manage approach deserves an extensive and detailed examination. While I will highlight some of its crucial aspects, it is necessary to understand its related concepts and strategies:¹

Log format and catalog: Adhering to the practice of composing well-structured log messages is crucial in order to capture pertinent and consistent information during runtime. Logs should effectively convey the who, when, where, and what, along with a severity ranking and a well-defined timestamp.
Deployment correlation: By utilizing deployment markers, it is possible to indicate deployment activities on the same chart or timeline that displays reliability metrics. This approach enables SREs to visually correlate reliability issues with recent deployment of a new version, making it easier to identify the cause of any issues.
Runbooks and knowledge base: A knowledge base serves as a central repository for storing any information and runbooks relevant to troubleshooting or resolving issues related to an incident. Entries may include troubleshooting runbooks and steps taken by SREs during a resolution process. However, runbooks must be automated and executed by first responders, rather than simply documentation.
Concurrent versioning: Multi-instances deployments enable running distinct versions of an application in separate location scopes, facilitating canary testing and collaborative database schema changes. For this to safely work, the use of feature flags can be leveraged, a technique that allows SREs and developers to turn specific features of an application on or off without changing the codebase.

Error Budgets

SRE incorporates the concept of error budgets to balance the goals of frequent, agile, and rapid development and maintain reliability. Error budgets provide a quantitative measure to manage risk and make informed decisions about stability, feature development, and deployment velocity.

An error budget is derived from the end-to-end SLO. For example, if a service has an end-to-end SLO of 99.99%, it means that it has a 0.01% allowable downtime, which is the error budget. This is then used to drive decisions and trade-offs between different teams in an organization. For example, if a service is well within its error budget, the development team may decide to push new features more aggressively. Alternatively, if the service is approaching or exceeding its error budget, the development team might slow down new releases and focus on improving stability and reliability instead.

By utilizing error budgets, SREs take responsibility (and accountability) for upholding the various SLOs in place.

Observability and Proactive Service Management

Observability is a key aspect of managing Always On applications, as it provides the necessary visibility into application behavior and performance. Observability practices include collecting and analyzing logs, metrics, and transactional traces from applications, infrastructure components, and transactions to identify trends, detect anomalies, and uncover the root causes of issues.

Traditionally, operations may only monitor infrastructure services and respond to failures, but SREs are paranoid. They monitor all aspects of a business service’s reliability and performance, both internally and externally. By monitoring trending and correlating every aspect of the business service, abnormalities can be detected before they lead to incidents and problems. This approach represents a significant shift from reactive to proactive service management. If a user or business client reports an issue that the team is unaware of, SREs consider their mission a failure.

Declarative and Continuous Deployment

GitOps is a modern approach to managing infrastructure and application deployments using Git as the single source of truth. By defining infrastructure and deployment configurations declaratively as code using tools like Crossplane, Terraform, and Ansible, stored in a version-controlled repository, GitOps enables automated, self-documented, and auditable continuous workload deployment.

GitOps empowers SREs to streamline deployments by writing code once and reusing it for multiple deployments. Deploying a new application or updating an existing one simply involves updating the repository. This approach helps ensure consistency across multi-active environments and reduces the risk of human error while performing changes across location scopes.

Graceful Location Scope–Based Updates

Location scope–based updates, or “one region at a time,” involve the gradual execution of planned changes or application releases by SREs to ensure zero downtime, similar to blue-green and canary deployment methods.

Before releasing an application or performing a maintenance job, SREs de-advertise a location scope (such as a cloud region) from the global traffic management pool to gracefully shutdown without causing errors to clients. Sometimes, SREs must also know the order of the components to shut down first, by reverse shutting down the dependency injection of a component, which allows them to work on their tasks without impacting SLOs and losing in-transit processes and data.

Once maintenance is completed or a new version of the application has been successfully pushed, all affected applications are verified and tested before the serving location is reintegrated into the global traffic management pool. These steps are replicated to other regions, one at a time, effectively allowing organizations to consistently deploy new application features and perform maintenance tasks during working hours and peak times.

To ensure the availability of a business service, SREs need confidence in its reliability. In the next section I discuss chaos engineering, a practice that is essential to test and validate end-to-end SLOs.

Chaos Engineering

Chaos engineering is a proactive and principled practice of conducting reliability experiments on a business service to strengthen its ability to withstand unpredictable and unforeseen conditions and to uncover issues that might not be detected during preproduction testing. This is achieved by intentionally injecting faults and errors into a component to observe how it behaves under adverse conditions. Examples of failure injections include node/pod failure, packet drops, cloud zone and region failures, latency, I/O delays, process kill, disk fill, certificate expiry, and clock changes, among others.

Contrary to what the name might imply, these experiments are carefully designed and orchestrated and follow a rigid method, as shown in Figure 4-1. The process begins with understanding the system end to end, identifying potential weaknesses, followed by formulating a hypothesis based on these observations.

Subsequently, a tailored experiment is planned and executed against the system. As previously discussed, SREs must have a comprehensive understanding of the entire technology stack and of the end-to-end application flow. This will help them to effectively experiment on all components.

It is important however, to begin simulating realistic scenarios by injecting likely failures and bugs and then steadily escalating complexity. For example, if latency has been an issue in the past, SREs can intentionally introduce faults that cause latency to occur. The results of the experiment are then measured and compared with the hypothesis to either confirm or refute it. By doing so, they can identify immediate weaknesses in the system and develop strategies to mitigate or eliminate them before focusing on more complex failure scenarios.

It’s essential that impact is contained to avoid cascading failures and to minimize blast radiuses. This can be achieved by targeting only a select group of services, possibly in a cloud region that’s been de-advertised by the SREs, or by ensuring that dependent services are gracefully disconnected. The use of feature flags and canary releases can also be helpful to limit the impact of the experiment to only a small group of users. It is also important to have workable rollback plans in place in case an experiment goes wrong if deemed necessary.

These experiments are typically conducted during designated “game days” to assess a service’s behavior in a controlled setting. As the organization matures and becomes more experienced, experiments should be conducted gradually in production to ensure the application’s robustness when it matters. Experiments can then be automated and continuously executed or as part of continuous delivery pipelines.

SREs can use chaos engineering tools available by cloud providers, including AWS Fault Injection Simulator and Azure Chaos Studio to experiment on cloud-native managed services. In addition, cloud-agnostic tools like Gremlin, Reliably, Chaos Toolkit, and LitmusChaos can perform more intricate experiments that involve multiple clouds and platforms.

Observability tools must be well configured to learn from relevant results and insights, enabling architectural improvement, issuing bug reports, or submitting feature requests. To that end, SREs must maintain open communication throughout the process, fostering a culture that appreciates the benefits of introducing controlled, short-term risks to enhance overall long-term reliability.

By simulating experiments and identifying issues proactively, organizations can identify and remediate architectural weaknesses, thus improving service reliability and stability early on. This will in turn provide confidence and evidence that their mission-critical services are meeting their end-to-end SLOs.

¹ Ingo Averdunk, “Build to Manage,” IBM Cloud Garage and Solution Engineering (GSE), December 23, 2019.

Get Cloud Adoption for Mission-Critical Workloads now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial