Chapter 4. Deployment and Release

All the participants I surveyed are making production deployments at least daily. In this chapter we’ll look at the techniques they use to achieve this release tempo.

In every single organization, the engineer who makes a change takes ownership of moving that change into production. They are also accountable for ensuring that the change does not cause production defects.

Single-Piece Flow

For smaller codebases owned by a single team, such as microservices, each change landing on master preferably only sits in staging briefly—just long enough for an engineer to make any last spot checks—before being promoted to production by the same engineer.

Several participants shared a strong preference for single-piece flow, a concept from Lean Manufacturing where batch sizes are reduced down to the single item that’s actively being worked on. Teams apply this concept in software by avoiding multiple changes batching up in staging.

Release Buses

A larger, monolithic codebase make it much harder to achieve single piece flow. It has such a broad scope that different teams own different areas (this diffused ownership is, in my mind, a good working definition of a monolith). At any one time, changes will be landing from multiple teams, and they’ll be arriving at a rapid pace, since a large number of engineers are all targeting their changes at the same monolithic codebase.

Organizations handle this scenario by batching production changes up into a release candidate. One engineer referred to a Release Bus approach, and describes it as follows: every hour, an automated system identifies changes that have landed in staging but have not yet been promoted to production.¹ These changes constitute the “passengers” on the next release bus, which is getting ready to head off to production. The system identifies the engineers who own these changes, and asks them all to confirm that their respective changes are good to go to production by performing whatever spot checks are necessary in the preproduction environment where that bus has already been deployed. If any engineer spots a problem the entire release is abandoned, and the bus is sent back to the depot. If all engineers give the thumbs up the bus is deployed into production, and engineers are notified so that they can ensure there are no production issues.

The organization that described the Release Bus system to me has made a large investment in automation. Other participants reported a similar approach, but orchestrated by an engineer, rather than automation, as part of a rotating Release Raccoon role. Once a day, this engineer would identify the batch of changes for the next release bus, coordinate with engineers and testers to validate that the bus is good to go, and then orchestrate the bus’s journey into production. The delightful Release Raccoon nomenclature comes from the Amplify team in this blog post, although the etymology is murky.

Coordinating Production Changes

Regardless of their investment in automation, every participant reported manual coordination and orchestration from time to time around production deployments.

An engineer might want to request a temporary pause on deployments while they investigate a production issue. As stated in Chapter 2, some teams will on occasion want to declare master as unstable (and thus not deployable). There are also situations where a change in one service depends on another change being deployed first, even though engineers agree that this sort of release coupling should be avoided as much as possible.

Participants have various mechanisms to manage this coordination, with the most common being communication over shared chat channels, often augmented with bots that contribute context such as deployments and alerts, along with low-friction remediation, an approach sometimes referred to as Chat Ops.

Participants with a large number of engineers invest significantly in custom release tooling, which includes coordination capabilities. For example, at the Food Delivery Service, engineers have the ability to “thumbs up” a specific build within a release dashboard, as well as to request a hold on production deploys for a service (with a note explaining why).

Custom Delivery Platforms

A common theme among participants was an investment in custom tooling to automate deployment processes. This appears to be an expensive but necessary investment to empower engineers to manage their own releases, which is widely regarded as extremely valuable.

This tooling provide a variety of capabilities, such as:

Tracking which version of each service is deployed into an environment
Reporting which new versions of a service are available for deployment
Signing off on a version as being ready for production
Requesting a hold on production deployments
Deploying a new version of a service into an environment, including in some cases capabilities for things like incremental rollout or blue/green deployment
Rolling back to a previous deployment
Showing a history of previous deployments
Performing data management tasks in an environment (such as reseeding test data or importing scrubbed production data)
Reporting overall service health in an environment
Providing a Service Registry—a way to view metadata about the service in an environment, such as team ownership, service dependencies, and quick links to production dashboards

Controlled Rollout

A faster release tempo means less time to test changes before they are put in front of users. You might think this means a higher likelihood of production defects, but research has in fact shown the opposite—deploying more frequently has a positive relationship with both a lower change-failure rate and a lower mean time to recovery (MTTR).²

Nevertheless, all participants do have mechanisms in place to reduce or mitigate the risk of a change causing a production defect, by allowing fine-grained control over how a change is rolled out to users in production. I collectively refer to these mechanisms as Controlled Rollout.

In Continuous Delivery there is a distinction between the technical act of deploying a build artifact and the user-facing act of releasing a feature to users. There are techniques to control rollout at both levels.

Incremental Deployment

At a low level, the deployment of a specific version of an artifact can be performed incrementally, using techniques like blue/green deployment (sometimes called red/black deployment, because naming things is hard), rolling deployment, and canary deployment.

You need some form of incremental deployment in order to perform a deployment without downtime. All participants are deploying to production very frequently, and incurring downtime as part of each deployment is not an option. Therefore, they all use some form of incremental deployment. Engineers at the Financial Services Startup can directly control that incremental deployment, as a way to manage the impact of a risky change. However, this is fairly unusual. For most participants the actual act of deploying a new build is an all-or-nothing operation as far as the engineer deploying is concerned, with no fine-grained control over the rollout.

Decoupling Deployment from Release

It’s possible to deploy the implementation of a feature without exposing that feature to users. Feature flagging is the technique that enables this decoupling of deployment from release. An engineer can deploy a half-finished feature into production, but hide it from users behind a feature flag, a mechanism that decides at runtime whether a given feature should be enabled for a user, based on some configuration.

Once the feature is complete, they can use that same feature flag to manage a controlled rollout of that feature. They might decide to initially expose it to 5% of users (a canary release), or they can opt to expose it to a specific cohort of users (an A/B test).³

All participants report that feature flagging is an important part of their Continuous Delivery practice, for two reasons. First, feature flags allow engineers to develop larger features incrementally—an engineer can integrate half-finished work to master, allowing one big, risky change to be sliced into multiple small, safer changes. Second, feature flags provide the safety net of controlled rollout, allowing risky changes to flow quickly into production with less risk of users being exposed to defects.

Correlating Cause and Effect

Engineers are responsible for rolling out production changes—and checking for any negative impacts from those changes—at all participating companies. This means they keep an eye on dashboards showing production metrics for some time after deploying a build or rolling out a feature.

In order to figure out whether a change has a negative impact an engineer needs to be able to correlate the observed impact (say, an increase in error rates) with a change (rolling out a feature). In other words, they need to be able to connect cause and effect. The most obvious way to do this is with temporal correlation—I see that error rates increased at 10:24 am, and I know that I rolled out a code change at 10:23 am. Environments with a rapid deployment tempo make this correlation more challenging. If I see a production issue and there’s been one deployment in the last few hours then I have a place to start looking. If there’s been 10 deployments in the last hour my job is a little harder.

Incremental rollouts bring further challenges when it comes to correlating cause and effect. After rolling out a risky change to a canary population (5% of users, let’s say), an engineer needs some way to compare and contrast metrics for that canary population versus the general population. Rather than solving this correlation problem in a general way—which would require a large technical investment—most participants achieve this correlation via proxy attributes. For example, the Healthcare Provider and the Food Delivery Service both roll out risky changes to a canary market, rather than a random sample of their user base. An engineer would roll out a change to all users in Denver, let’s say, and then keep an eye on whether metrics for users in Denver are changing relative to the metrics in other cities.

Moving Fast with Safety

We’ve seen that participants achieve the most rapid release tempo by maintaining a continuous flow of small, independent changes into production. This requires a set of practices and techniques, as well as discipline, but the outcomes are worthwhile. The same tools that allow a team to make small, incremental changes also reduce the risk associated with a feature release, and greatly improve the team’s ability to react to a bad change when it does occur.

¹ I assume that the Release Bus naming is a play on the traditional Release Train approach, where an extremely large batch of changes accumulates over a multiweek period, with a cut-off date at which the “train leaves the station” and no further changes are allowed into that batch.

² Accelerate, Chapter 2.

³ Feature flagging enables a bunch of additional controlled release patterns. The Managing Feature Flags report from O’Reilly is a good resource for more details.

Get Continuous Delivery in the Wild now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Continuous Delivery in the Wild by Pete Hodgson