Chapter 4. You Must Be This Tall

The phrase “You must be this tall to ride” commonly appears on signs at amusement parks and carnivals to let people know the minimum height requirement for certain rides. The signs are not meant for gatekeeping, but for safety. Martin Fowler used this metaphor in his article about prerequisites for microservices architecture. In the same fashion, you can think of this chapter as the “you must be this tall” sign that can help you figure out whether it’s safe for your team to get on the very fast ride of continuous deployment. In particular, I will describe a list of safety-focused practices that teams should implement before switching to a fully automated pipeline.

Sending each commit to production without manual intervention has the potential to break things, of course. Critical defects slipping past inadequate quality gates can cost businesses some serious money and scare stakeholders right back into overcomplicating the release process (and into putting heavy gatekeeping on production). That’s why it is our responsibility as software professionals to carefully evaluate whether our teams are ready, and if they aren’t, to place continuous deployment within the context of a bigger journey of continuous delivery maturity. The goal of this journey should be to build up a technical and organizational foundation that allows people of all levels of experience to participate in a fast-paced deployment life cycle.

The purpose of this chapter is not to explain how to prevent developer error. We should accept that developers are human beings who will have bad days and that mistakes will happen. Our focus should be on catching errors early and fixing them quickly rather than trying to achieve absolute developer perfection, which is an unmeetable (and unfair) standard. We can strive instead to build a safety net of practices and automation that makes it OK to fail and easy to recover. That’s why we will talk about continuous feedback tools such as frequent integration, thorough testing, code scanning, and observability. Proficiency with these safety-focused practices is what we should be judging our teams’ performance by, instead of the occasional mistake.

Now that I have put forward this disclaimer, let’s take a closer look at the practices themselves. Most of them are well-established practices of continuous delivery for which we’ll just do a small refresher, while others have emerged more recently and are especially relevant for continuous deployment. This list will inevitably be incomplete, and new innovative techniques might pop up after this book is released. Nevertheless, let’s use it as a start.

Cross-Functional, Autonomous Teams

Siloed teams are typically organized around specific roles or disciplines, such as development, testing, or design. A cross-functional team, on the other hand, consists of members with a range of skills and should include all the roles that are required to deliver the product as a whole (see Figure 4-1). This could include infrastructure skills, frontend skills, and backend skills, but also testing, security, design, and project management.

Figure 4-1. Siloed teams (left) versus cross-functional teams (right)

Cross-functional teams have several benefits compared to teams that are siloed, with closer collaboration between roles, increased speed, and less organizational friction being the main ones. When a product team is truly cross-functional, it also has several qualities that enable continuous deployment.

Fast Decision Making

The biggest enabler for fast-paced delivery is that cross-functional teams can act spontaneously and be flexible, without the need to involve others in their decisions. Their decision-making process is self-contained, so they can quickly adapt to changing requirements without having to reach across team boundaries to implement a change.

A team continuously deploying small increments to production needs to be equipped to handle the rapid pace of changes that comes up with a very short feedback loop. With continuous deployment, new code updates are released to users on a daily basis, or even multiple times per day. This requires a high level of adaptability in which the team must be able to react quickly to changing direction and potential fixes.

A team that depends on outside support or approval can quickly find itself overwhelming (or annoying) its external collaborators if it starts to deploy many times a day. The more the team succeeds in its goal to release frequently, the more it will upset the outsiders, which is not what the team wants.

Implementation Autonomy

A truly cross-functional team should contain all the engineering skills required to build and deploy the application. Continuous deployment (or even continuous delivery) doesn’t do well with backend-only, frontend-only, or infrastructure-only teams, because deploying often can surface the complex interdependence between all these software components. Changes to one part of the system often require corresponding changes to another part in order to roll out safely, and vice versa. And when small changes are applied individually to production it is essential to avoid constant blockers introduced by team boundaries, which would cause large amounts of work in progress to get stuck.

Feature flags and expand and contract, for example, are especially common coding techniques with continuous deployment, and they require that developers update provider and consumer systems in quick succession so that they can guarantee the stability of the production environment. If teams were siloed based on tech stack or specific parts of the system, the coordination of these changes would become difficult: teams naturally have different backlogs with different priorities, and they would often be stuck waiting for each other rather than working on features.

Overall, the ideal team practicing continuous deployment is one that contains all the roles and technical skills necessary to take care of a vertical slice of the business: a product team. This works especially well in microservices-oriented architectures where products are technically as well as organizationally isolated from each other.

Having cross-functional teams does not mean every team should look the same, or that there is no room for specialized teams anymore. In fact, in their book Team Topologies,1 Matthew Skelton and Manuel Pais recognize four categories of teams, some of which are highly specialized:

Stream-aligned team

As the name suggests, a stream-aligned team is aligned with a particular stream of work (or domain). This type of team is responsible for delivering value to customers quickly through new user-facing features and improvements.

Enabling team

An enabling team supports stream-aligned teams by providing specialized expertise and tools. For example, it may help other teams adopt new practices or technologies, or solve complex problems.

Complicated-subsystem team

This type of team works on complex subsystems that require deep technical knowledge; for example, ones that require expertise in calculation or mathematics.

Platform team

A platform team creates and maintains a set of internal products and tools that other teams can use to accelerate their work, usually ones that can be used in a self-service fashion.

Each of these teams will have a certain specialization, whether that is a business domain, a particular technology, or a complex type of problem. And all of them could maintain their own products too, whether they are externally facing or internally facing. Those products are all potential candidates for continuous deployment. The requirement for practicing continuous deployment on those products, however, is that within that team’s specialization (whichever it may be), there are all the capabilities required to develop them and deploy them. This means full autonomy and mastery of the chosen software stack, from end to end.

For example, in one of this book’s case studies, Tom Vollerthun from OTTO describes in detail how his company transitioned the QA role from being held by one gatekeeping team to letting individual QA engineers be members of product teams. This was a key enabler for the adoption of continuous deployment at OTTO, and I encourage you to read his case study to understand how the company achieved it.

Frequent Integration

As I discussed in Chapters 1 and 2, integrating code frequently is the backbone of continuous integration, and it is a baseline for continuous delivery and deployment too. In practice, it means adding our code changes to the team’s shared, mainline branch at least once a day (or multiple times a day). Frequent integration keeps change deltas small and manageable, which is a principle that becomes imperative to follow when there is no manual verification in a preproduction environment.

Hoarding thousands of changed lines on our development machines or in a branch, only to send them to production all at once, can generate chaos and disrupt users, stakeholders, and the rest of the team. Therefore, it is fundamental that all team members are aligned on a code commit etiquette that keeps changes small and integrates them frequently into main.

A good commit etiquette with continuous deployment should also include tools to make the version history easy to understand: squashing interdependent commits together, rebasing frequently to allow for fast-forward merges, and meaningful commit messages with a task identifier and coauthors. All of these small actions allow the team to keep clarity on which code changes are bundled with which production deployment.

The main tool to improve our understanding of what is deployed, however, is frequent integration, first and foremost. There are two main ways in which modern software teams achieve this: (very) short-lived branches and trunk-based development (TBD). Both are compatible with a continuous deployment strategy, although TBD is definitely a favorite due to its simplicity.

Short-lived branches

With short-lived branches, a developer can simply create a new branch off of main, make their changes, and then submit a pull request (PR) to directly merge the changes back when they feel their work is self-contained enough. Once the changes have been reviewed and approved, they can be merged into the main branch and go to production. This offers an optional checkpoint to perform code reviews. It is important to note that for a branch to be called “short-lived,” the development of a big feature should normally outlast the shelf life of an open branch, and developers will need to merge back into main multiple times during development. This is why short-lived branches should still be used in conjunction with other techniques for hiding work in progress that don’t rely on version control, such as feature toggles.

Not all branches are created equal. It’s important to remember that short-lived branches are intended to be used for small, focused tasks and should not be used for long-term development efforts or whole features. Ideally, they shouldn’t live for more than a day. Short-lived branches are in direct contrast with feature branches, for example, although the two might look the same in our version control systems.

Short-lived branches versus feature branches

Long-lived feature branches are an antipattern in continuous integration. They are typical of development workflows such as Gitflow, which aims to segregate the changes of entire initiatives until they are ready to go to production. Such workflows introduce a tight coupling between the release process and the functionality of the version control system, and they discourage the use of more modern techniques like feature flags.

It must be acknowledged that models like Gitflow work remarkably well for open source projects on collaboration platforms such as GitHub, where developers collaborate over longer periods and communication is asynchronous by nature. However, in a cohesive team with real-time communication channels, long feature branches bring more overhead than value. In fact, they are actively disruptive. They encourage code drifting significantly from production and let it accumulate into big, painful batches that lead to messy merges and tricky releases.

Many teams successfully use short-lived feature branches with continuous deployment, but this requires a great deal of discipline and continuous integration maturity in order to not degenerate into long-lived feature branches. All it takes is for one developer to succumb to inertia and forget to integrate for a day or two, and a branch will accumulate enough changes that it cannot be called “short-lived” anymore.

Using branches as part of everyday coding in the team makes it easy to do the wrong thing (accidentally hoard changes), and makes it hard to do the right thing (consciously integrate often). That is why many teams that look to encourage good practices use a different paradigm: trunk-based development.

Trunk-based development

TBD is a methodology in which all developers work on a single branch, typically known as trunk or main. This approach is in contrast to other models relying on separate branches, where developers keep their changes away from main and merge them periodically.

In TBD, developers are encouraged to commit small, incremental changes that keep the codebase green and deployable at all times. This allows for an even shorter feedback loop, as changes are available to all other developers as soon as they are committed. With continuous deployment, they will also be available to users minutes later.

Another key benefit of TBD is that it reduces the complexity of the team’s version control. With branches of any kind, it can be difficult to track changes and merge them back into main due to conflicts and delays. By working on a single branch, developers can avoid these issues and focus on building new things.

TBD is still controversial in some communities, but it is practiced successfully by many teams, including most of the ones I have been lucky enough to work with. This is what the DORA researchers shared about it in 2018:

Our research also found that developing off trunk/master rather than on long-lived feature branches was correlated with higher delivery performance. Teams that did well had fewer than three active branches at any time, their branches had very short lifetimes (less than a day) before being merged into trunk and never had “code freeze” or stabilization periods. It’s worth re-emphasizing that these results are independent of team size, organization size, or industry.

Even after finding that Trunk-based Development practices contribute to better software delivery performance, some developers who are used to the “GitHub Flow” workflow remain skeptical. This workflow relies heavily on developing with branches and only periodically merging to trunk.2

Notwithstanding this, there are some challenges to using TBD. The whole team adding to the same branch can make it more difficult to work on multiple tasks concurrently and to coordinate activities among multiple developers. Also, the burden of ensuring that developers won’t step on one another’s toes (or one another’s lines of code) needs to be addressed during the planning of day-to-day work. But this is not necessarily a negative: when working with branches, developers still risk making overlapping changes; they just wouldn’t notice these changes until merge time, when context is stale and the contested lines might have diverged even more significantly. One could say that TBD helps merging issues surface earlier, when they are easier to fix.

When practiced together with continuous deployment, TBD means every single code commit will be immediately deployed to production. This makes for the most straightforward implementation of a one-piece, continuous flow of changes. As discussed in Chapter 1, this concept from Lean manufacturing is what makes continuous deployment so powerful: it eliminates waste and batching from the path to production. I would argue that for this reason, the combination with TBD is the purest implementation of continuous deployment, although the use of very short-lived branches remains a good compromise where this is not an option.

In the case study on digital bank N26 in Part V, you can read about such a situation: unable to do TBD due to regulation constraints, N26 engineers use microbranches and PRs to provide proof of peer review and to ensure that no arbitrary changes are made to the system by single developers. However, they couple this process with pair programming and mob programming so that the code review happens live and integration into main can be expedited.

This brings us to the next topic: reviewing code.

Frequent Code Reviews

Code reviews are essential, as they provide a crucial point of human feedback for the design, correctness, and completeness of the code. Under continuous deployment, this human channel of feedback is also the only human form of feedback in the entire path to production. This makes code reviews especially meaningful, as they become the only tool that ensures that every line of code is checked by more than one set of eyes before production.

No matter how well tested a piece of functionality is, if the developer who wrote it has misunderstood the requirements, they will write incorrect tests to go along with an equally incorrect implementation. No matter how many elaborate code scanning tools we have, only a human can detect whether code respects functional requirements and whether it respects team agreements regarding design and structure. There are many such code design principles that go beyond trivial linting rules, and books upon books have been written about them. For example, code needs to be well partitioned, be unsurprising to read, belong to the right level of abstraction, and be conceptually aligned to its architecture. After all, if we were writing code only to be understood by machines and not by other people, we might as well ditch all of our design books and highly abstracted programming languages and go back to the ancient assembly spaghetti that our grandmothers3 had to deal with.

Putting such an emphasis on code reviews might seem at odds with the other messaging in this book so far. As we discussed in Chapter 1, we are striving to completely remove manual bottlenecks from the path to production. Aren’t code reviews an example of a manual bottleneck where changes might accumulate and get stuck? And didn’t we just look in the preceding section at the benefits of TBD over long feature branches and PRs? How are we supposed to perform code reviews without PRs?

Pull requests

It is worth mentioning that by keeping their branches small, a lot of teams also create very tiny PRs, which lead to quick code reviews that don’t disrupt the continuous flow of code to production all that much. In those teams, all developers need to be very engaged with the code review process so that they can minimize the time spent waiting by their colleagues who want to integrate. A lot of engineers work this way, and they manage to keep their wait times reasonably low and achieve a somewhat smooth workflow.

Still, I think we can do even better than that.

Something I noticed is that over the years we have come to collectively associate the review of an open PR with the only time and place for code to be reviewed. I would like to challenge this concept. There is another practice in the eXtreme Programming toolbox that offers an alternative to PRs as the engine for code reviews: pair programming.

Pair programming

Pair programming is a very old practice; almost as old as programming itself:

Betty Snyder and I, from the beginning, were a pair. And I believe that the best programs and designs are done by pairs, because you can criticize each other, and find each other’s errors, and use the best ideas.

Jean Bartik, one of the very first programmers4

Pair programming regained popularity in the early days of Agile, although it seems to have sadly fallen out of fashion, as many companies have forgotten to make it part of their “Agile transformation.” But as practices like TBD and continuous deployment become increasingly popular, pair programming is worth reevaluating, as it can offer more safety than ordinary code reviews through PRs.

With pair programming, all production code is developed by a pair of developers sharing a keyboard and screen; virtual ones in the case of remote pairing. As the pair work through a task, they switch roles between typing and reasoning about the code design. As each member of the pair has to verbalize their assumptions and design ideas, they continuously debate the implementation and the requirements, therefore performing a continuous code review.

A second set of eyes is on the code before and during the writing process, not just after the fact. This can be more helpful than a review at PR time because it offers a much bigger (and earlier) window of opportunity to amend design errors or clarify misunderstandings of the requirements. It also happens to avoid the social awkwardness of requesting big changes after a colleague has done a lot of work on a PR, which is a further barrier to code quality (and, sadly, one that I have seen get in the way many times).

Pair programming can also speed up the implementation of features and the resolution of bugs because more than one brain is available to tackle problems as they come up. It also speeds up integration because it removes the bottleneck of having to find available reviewers, who might need to context-switch in order to unblock their colleagues. Due to this continuous and more engaging code review process, the final design of the code is usually of higher quality and requires less rework, saving a lot of time.

The main objection to pair programming is usually along the lines of “it takes twice the work hours to implement the same thing!” But I find that to be inaccurate in most cases, because it fails to consider all the time it has saved. Even if that objection was accurate and it really was that much more expensive to have a continuous code review process, I would argue it is still an investment worth considering. We are deploying every commit to production, after all, and we aim to keep a high level of safety from human error in the process. Speed and agility always require an investment.

Personally, I have used pair programming as a code review tool in almost all of my teams, and most developers I worked with found it to be a great help in delivering products, onboarding new team members, and maintaining a shared sense of code ownership.

Psychological safety

Regardless of whether your team uses PRs or pair programming, you should ensure that code reviews are a detailed and frequent process if you plan to adopt continuous deployment. It is the responsibility of all senior team members to create a space where all colleagues, especially junior ones, feel empowered to give honest feedback and ask difficult questions. The definition of “good code” can be personal, but it doesn’t mean it shouldn’t be debated and negotiated by the team every day. Briefly upsetting someone’s feelings is never pleasant, but it is better than the alternative: a circle of ruinous empathy where everyone is patting one another on the back at the expense of the product’s stability in production.

Automated Code Analysis

We have talked about the importance of more than one set of eyes looking at code, but that doesn’t mean that catching common oversights and mistakes cannot be automated. This is where code analysis tools can play an important role, also enhancing the safety of continuous deployment. With the help of automation, developers and their pairs (or PR reviewers) can stop worrying about finding low-level issues that might easily be overlooked, and instead can focus on the bigger picture: for example, how the changes fit in with the existing architecture, how they should be released, and whether they satisfy requirements.

Static code analysis tools can analyze code without actually executing it, and they are usually quite fast, so they can be integrated as an early step of the pipeline to catch common mistakes, or even in IDEs and pre-commit hooks. They can be used to identify all sorts of common issues, such as bugs, security vulnerabilities, and resource utilization issues, as well as enforcing coding standards early.

Many open source code analysis tools are available that support a wide variety of programming languages. There might be some up-front setup and configuration to be done, but most of them are fairly straightforward to keep using afterward. I find that in the vast majority of cases, the reasons to include them outweigh the reasons not to.

In short, static code analysis tools are excellent at preventing bugs that originate from inattention and common programming mistakes. However, there are two features that I want to especially call out as useful in a continuous deployment scenario: security vulnerability scanning and performance analysis.

Some of the human errors with the direst consequences for popular applications are related to security and performance. They are also among the hardest to spot, as automated tests usually look for regressions in behavior rather than in the cross-functional characteristics of the software. As developers work in small increments, it is quite easy to be forgetful and introduce a resource leakage that will only cause problems in an environment experiencing heavy load, such as production. It can be equally easy to forget to sanitize our inputs correctly in every commit, opening the system up for yet another problem that will only be evident once it is in front of unknown, untrusted users. Automated code scanning alleviates those concerns and can give peace of mind to developers and stakeholders alike when considering everything that could go wrong with a constant stream of changes.

Test Automation

As this is the 21st century, it should go without saying that test automation is preferable to manual regression testing before each deployment. It is faster, more efficient, more consistent, and cheaper. Automated tests can be run quickly and repeatedly, without the need for human intervention, so they are not subject to human error or variation in the way that manual tests are. Software testing of every commit is the textbook example of a repetitive and exact task that is perfectly suited to the endless patience of a computer, and it has no business being performed by human hands. Human creativity and attention should be reserved for challenging assumptions and pushing the system in unexpected ways, not repeatedly verifying the same features over and over.

Automating any regression tests that used to be manual is something that should be at the top of the to-do list of any company by now, but it should be taken especially seriously by teams looking to adopt continuous deployment.

We shouldn’t continuously deploy code that doesn’t have good test coverage. As Michael Feathers writes in Working Effectively with Legacy Code, code without tests is just as bad as (and can be considered) legacy code:

To me, legacy code is simply code without tests. [...] Code without tests is bad code. It doesn’t matter how well written it is; it doesn’t matter how pretty or object-oriented or well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don’t know if our code is getting better or worse.5

Indeed, it doesn’t matter how pretty our code looks: without a pipeline backed by thorough automated tests, we can’t stop regressions from being deployed to production. Absent or neglected test coverage might have been somewhat more tolerable with changes stopping in preproduction to be manually verified, but it becomes reckless when the gate to production is wide open and no manual verification is possible. Later in his book, Feathers goes on to say that there are two ways to make changes in a software system: “Cover [with tests] and Modify,” or “Edit and Pray.” It shouldn’t need saying that if we are using continuous deployment with the Edit and Pray approach, we are going to have to pray extra hard.

Now that we have that disclaimer out of the way, we can talk about what kinds of tests are necessary. After all, there are many types of automated tests in a developer’s toolbox, and they come with all sorts of levels of abstraction and granularity: unit tests, integration tests, acceptance tests, component tests, visual regression tests, contract tests, journey tests…just to name a few (I couldn’t possibly cover them all in this section, or it would become its own book). Beyond unit tests, which are generally well understood, the terminology has been historically fuzzy throughout the industry, with competing definitions for several types of tests. If you lock two developers in a room and show them the same test code, you will probably get three different names for it.

However, after working in a few teams, I started to realize terminology doesn’t really matter as long as the whole team agrees to and sticks to the same definition. Each team member should be aware of the types of tests used in their team, their level of abstraction, when and where they are appropriate, and the boundaries of their system under test. The team should periodically update its testing strategy and renegotiate which coverage is needed as its product grows.

The layers of tests to use will vary from application to application and from tech stack to tech stack. Which types of tests to add, and how many, is a matter of opinion and might be unique to each team, but I find that the most helpful rule of thumb is to follow the well-known testing pyramid model.

The testing pyramid model

The testing pyramid is a visual metaphor that describes the different categories of testing that should be performed on a system. It is a pyramid shape because the idea is that there should be lots of low-level tests, such as unit tests, and much fewer high-level tests, such as end-to-end tests. The tests at the bottom of the pyramid can be numerous, as they have a high granularity (individual classes or functions), run very quickly, and are easy to write. On the other hand, the tests at the top of the pyramid are comprehensive and valuable, but they also run a lot more slowly and require elaborate setups, so we should use them only to validate the most valuable behavior of the system rather than all of its details.

In Figure 4-2, you can see some examples of what a testing pyramid might look like for two different types of applications: a REST API and a one-page application frontend.

Figure 4-2. Two examples of testing pyramids

In the past, only the unit tests at the bottom of the pyramid were written during the implementation phase by the developers themselves. The laborious task of writing high-level coverage has been historically left to QA roles as a productivity optimization of the manual testing work they would have been doing anyway. This would usually happen after the development phase.

As we discussed in Chapter 2, this approach to test automation is not sustainable in modern teams using continuous deployment. When code never sits waiting in an artifact repository or in preproduction, the time window for adding high-level tests “later” is lost. Therefore, it is imperative that the team shifts testing to the left and that each layer of the testing pyramid is updated during the development phase itself.

Working with automated tests at each level of the testing pyramid is something that every team member who writes code should feel comfortable doing, regardless of their seniority. Only when this is true can each change go to production safely, no matter who produced it. With immediate production deployments, writing good tests might even be more important than writing good code.

The Swiss cheese model

Another good model for test coverage that can complement the classic testing pyramid is the Swiss cheese model. This model was first proposed by James T. Reason, and while it was originally applied to domains such as aviation safety, engineering, and healthcare, it is excellent at representing software testing as well.

In this model, all layers of the testing pyramid can be thought of as different slices of Swiss cheese. The holes in each slice represent different weaknesses or missed coverage in the testing layers. A defect might pass through one or two layers, but it can later be caught by another one with slightly different coverage, as shown in Figure 4-3. The bugs that make it all the way through the slices of cheese are the ones that the users get to experience in production.

Figure 4-3. The Swiss cheese model

By examining the characteristics of each layer (e.g., speed, flakiness), we can reason about what is the appropriate amount of coverage. For example, tests that are appropriate to write in the first, more detailed layer might be way too granular for slower and more expensive layers, and would be redundant as well.

The Swiss cheese model can also be helpful for making decisions about areas where there is an unavoidable overlap between the layers. More overlap means more protection in case the coverage is changed incorrectly in one of the other layers, but it also means a higher cost of maintenance, as changing the functionality will require updating more layers of testing.

Test-first

Continuous deployment requires writing tests during the implementation phase of code, but that in itself is not a new concept. Unit tests have been integrated into the development life cycle for a while, especially with the introduction of the “test-first” principle and test-driven development (TDD). You can read about TDD in much more depth in Test Driven Development: By Example,6 but for now I’ll summarize how it works.

As shown in Figure 4-4, TDD consists of three phases: writing a failing test, writing the minimum amount of code required to make that test pass, and finally refactoring the code once you are protected by the test.

Figure 4-4. The TDD life cycle

Practicing TDD is another one of my default recommendations whenever a team is considering automated deployments. The reasoning is simple. Writing failing tests before writing the code to turn them green is a reliable way to make sure that tests actually get written and coverage remains sturdy. This might seem obvious due to the rules specifically forbidding writing any production code without a corresponding test, but there is another, more subtle advantage to test-first over test-after that acts as a positive influence on test coverage.

By writing the tests before the code, the tests act as the very first consumer of the code’s API. This forces developers to think deeply about the contracts of their classes and functions and the way they interact with one another, before they even think of their implementation. As a result, code that is written test-first is inherently testable and modular. On the other hand, writing tests after implementation is finished doesn’t always work well: a developer might find that they have a hard time injecting the necessary mocks or setting up the system for the test to execute when they didn’t design their code with testability in mind. They might have produced code that is very dense, which leads to complicated tests that need to perform a lot of assertions or setup. This added difficulty might discourage developers from writing tests after the fact, or it might lead to not covering the functionality as thoroughly as needed. Though it might seem counterintuitive, writing tests first is easier than writing them afterward.

With continuous deployment, all work in progress should be hidden under feature toggles or the expand and contract pattern so that we can commit it at any point as long as it compiles and passes the tests. With the addition of the quick TDD loop, what follows is that the codebase should always be, at most, one failing test away from being committable, and therefore deployable to production.

Outside-in

TDD is a very useful practice for designing software with unit tests leading the way, but it covers just the “unit” layer of the testing pyramid, or the Swiss cheese. You might be wondering, where do higher-level tests fit into this process?

It turns out that it is easy to also incorporate higher-level tests in a test-first workflow. This process was described in Growing Object-Oriented Software, Guided by Tests:

When we’re implementing a feature, we start by writing an acceptance test, which exercises the functionality we want to build. While it’s failing, an acceptance test demonstrates that the system does not yet implement that feature; when it passes, we’re done. When working on a feature, we use its acceptance test to guide us as to whether we actually need the code we’re about to write—we only write code that’s directly relevant. Underneath the acceptance test, we follow the unit level test/implement/refactor cycle to develop the feature.7

Figure 4-5 illustrates the process.

Figure 4-5. Outside-in TDD

Failing high-level tests that are written before implementation can act as a guide for developers, giving feedback on the completeness of their feature and letting them know when the code they have implemented is sufficient. However, high-level tests might stay red for a long time, sometimes much longer than what is a desirable interval between code commits. It is OK to mark them as ignored before performing a commit of in-progress work so that they won’t fail the pipeline, reenabling them locally, and only checking them in once they are green. Of course, this implies that the incomplete code is well hidden and will not impact any existing functionality.

In my experience, using a combination of the test-first and outside-in principles when we worked with continuous deployment was a major part of what allowed our teams to feel confident about the test coverage of our application. Each new line of code we added was leaving behind a trail of unit and high-level tests. Similarly, every bug we fixed was first proven through a failing automated test that we could make green and that would prevent the bug from popping up again. This approach was making our safety net for regressions sturdier and sturdier as our products evolved.

What about legacy?

Not every team has the luxury of working on greenfield codebases where they can progressively build up their test coverage as they build up their code. Furthermore, it is a tricky decision to make as to whether a legacy or inherited application can do well under continuous deployment, and if so, at which point in time. Also, relying only on TDD to add coverage opportunistically might not be enough, as some legacy areas of the code might remain untouched for a long time or be difficult to refactor. In scenarios such as these, test coverage often needs to be worked on up front for the system to be safely changed. When the application code is very tangled, even making openings for the sake of adding tests can impact unrelated areas that haven’t yet been covered.

My rule of thumb here is to implement a high-level test suite first, which can poke and prod the system from the outside and treat it like a black box, instead of attempting any code untangling to add tests at the unit level. This should be enough to verify that any business-critical functionality is well protected and allows you to refactor some openings later on.

Such a test suite can be created even if we don’t necessarily understand all the system’s features, which might be buried under a mountain of convoluted code that is the result of years and years of requirement changes. With the approach that Michael Feathers describes as “characterization testing,” we can use the tests themselves to poke and prod the system and challenge our assumptions about how it works.

With characterization testing, we can write tests that trigger a behavior with the input we want to test, but then go on to make “dummy” assertions that we know will fail, such as asserting against null values. The failure message will reveal the actual result of the operation, which will allow us to go back and amend our test to make it green. Then we can go on to the next test, until we have exhausted all the different types of input we think the system might receive in the real world. This process leaves behind an executable specification of what the production system currently does, and can protect it from unintended changes later on (even when its behavior might be counterintuitive).

Characterization testing can be helpful in preparing a legacy application, if not for continuous deployment then at least for safe refactoring and adding features.

Zero-Downtime Deployments

Perhaps one of the most obvious items on this list, but one worth mentioning just in case, is zero-downtime deployments. Zero-downtime deployments are a prerequisite for teams that want to deploy very often. We definitely don’t want our users to see a maintenance window message multiple times per day, which is what happens if we just tear down our infrastructure and rebuild it with the new version, as shown in Figure 4-6.

Figure 4-6. Deployment with downtime

There are several techniques to avoid a deployment window and achieve zero downtime, the most well-known being blue/green deployments and rolling deployments.

Blue/green deployments

Blue/green deployment is a technique that relies on using two identical production environments, referred to respectively as “blue” and “green.” During a deployment, the new version of the application is initially deployed to the blue environment. Once the blue environment has been proven to work as expected and it is ready to go live, incoming traffic is rerouted and the new version is live. Both the blue and green stacks are up and running during deployment, and the traffic entry point simply switches between the two; see Figure 4-7.

Each company might implement a blue/green setup a little bit differently. Most spin up a new environment right before deployment, while others leave both environments always running for extra safety, mainly so that they are able to roll back at any time (although the double infrastructure can get quite expensive). Which stack is referred to as “blue” and which is “green” can vary as well. In some cases they have fixed names, while in others the names are swapped upon deployment. Implementation details do not matter for our purposes, though, and we can simply refer to blue/green deployment as any setup that alternates between identical fleets of production servers.

Figure 4-7. Blue/green deployment

One advantage of blue/green deployment is that it allows for rapid rollback in the event of a problem with the new version of the application. If there are issues with the green environment, traffic can simply be routed back to the blue environment, minimizing downtime and minimizing the impact on users.

A blue/green deployment relies on keeping two different versions of the application up and running, at least during a small overlap window. This guarantees there will always be at least one running version of the application available to serve traffic, which removes the downtime gap.

However, it is important to note that this overlap causes some overhead on developers. When making a change, they need to keep the codebase for each new version N of the application always able to run alongside version N – 1. This is especially true if we want to truly maintain the ability to roll back.

Maintaining N – 1 compatibility means developers need to be especially careful, for example, when applying database schema evolutions, changing the contract between backend and frontend, or changing the contract with any other external components.

Rolling deployments

Rolling deployments (or rolling updates) is a technique for deploying updates to the application cluster by replacing the current instances with fresh instances containing the latest version. As the new instances become healthy, the old ones can be gradually phased out, as shown in Figure 4-8. This technique is commonly used in setups where the application is running on a container cluster, such as ECS or Kubernetes.

Figure 4-8. A rolling update

This approach is less expensive than blue/green deployment, as it only needs one running stack of the production infrastructure. It is most commonly used with container-based deployments, although it can be done with old-fashioned virtual machines as well.

Like blue/green deployments, this approach also suffers from the N – 1 compatibility problem: for a short amount of time, instances of the new version will coexist with the old, so it is important to ensure that the contract with any external systems remains compatible with both versions.

Canary deployments

Before we start describing canary deployments, it is worth mentioning that this is an area where terminology can get confusing: canary deployments are sometimes referred to as canary releases and vice versa. Deployments and releases are distinct events, especially in the case of continuous deployment, where teams routinely decouple releases and deployments through feature toggles (as I explained in Chapter 3). In this section, I will talk about canary deployments only, which means rolling out new instances with the newest version of the code and configuration. We will talk about “canary releases” in Chapter 12, where I will show you how to perform progressive rollouts of a visible feature (ideally at runtime and through the use of a feature flag that doesn’t require a new deployment).

A canary deployment can be seen as an increment on zero-downtime deployments. Its goal goes even beyond providing zero downtime: it also allows for validating the new version of the application with a subset of traffic before rolling the update out to all users; see Figure 4-9.

This is achieved by deploying a subset of the instances with the new version (the canary), and then making the rollout to the rest of the infrastructure conditional to how well the new version performs. This comparison is automated by collecting metrics for both the new and old versions and comparing them against each other. This capability is offered by tools such as Spinnaker.

Figure 4-9. A canary deployment

This strategy gives more accurate feedback than a simple rolling or blue/green deployment, where automated checks often consist of just a simple health check or a smoke test. A canary deployment can detect things that are much more interesting, such as a significant difference in application error rate or performance issues.

Canary deployments automated around application metrics can be an extremely powerful tool to check that new versions of the code don’t have an unforeseen impact on production. This kind of thorough, extensive feedback can be a prerequisite in large companies that want to adopt continuous deployment but are fearful of the impact on performance or other critical cross-functional requirements.

However, canary deployments can have some major downsides. They can be quite complicated to set up, and they require the metrics they rely on to be meaningful and stable. Additionally, collecting the necessary data for a statistical analysis to yield accurate results could take a long time, slowing down the deployment process and creating a bottleneck.

Given these issues, I would consider canary deployments to not be a “must have” for continuous deployment in most small to medium-sized organizations. Personally, I have used simpler deployment strategies in all the teams where we practiced continuous deployment. We did not feel the necessity for sophisticated canary deployments, as our test coverage, feature toggles, and observability and alarms were comprehensive enough to keep us safe.

That said, canary deployment is definitely an interesting technique that can put stakeholders at ease when new deployments are seen as high risk, and I’m excited to see them evolve and be adopted by more and more companies.

Deployment strategies and manual steps

It is worth noting that some teams perform manual steps within blue/green or rolling deployments as a QA tool and/or a feature release tool—for example, by performing only a partial deployment to production and then executing some manual verification steps before completing it.

This type of workflow fits reasonably well with continuous delivery, but it is not compatible with continuous deployment. When the rate of commits arriving to production is much greater, adding human intervention in the middle of deployments awkwardly builds up a huge queue of changes to sort through. Manual intervention also makes the new version of the application in production behave like preproduction: deployments themselves become a queuing point where we wait for testing or experimentation to be done before the “final final” deployment step. Partial deployments with manual activity around them are still a gate to production.

I would discourage such an approach, and I believe that runtime feature flags and automated testing are much better suited for verifying changes without coupling deployments to QA and product experimentation.

Antipattern example: Blue/green as a QA tool

I have come across teams relying on manual blue/green deployments to test new versions on the blue environment. This involves accessing the temporary environment in a private URL and performing regression testing to ensure that everything works as expected. Only when the team has determined that the new version looks good will it switch to the green environment. However, this process should be automated if the team wants to switch to continuous deployment, perhaps by replacing it with automated regression testing in lower environments, automated smoke tests, or canary deployments.

Antipattern example: Partial deployments as a canary release tool

Similarly, teams may be tempted to manually control rolling deployments as a form of A/B test to validate new features. They include the new feature in the next version of the application but only roll it out to a few instances. If the feature performs well with the exposed traffic subset, stakeholders will decide to roll it out to the rest of the infrastructure.

As I mentioned, this sort of manual and partial rollout of changes is not compatible with continuous deployments to production. The deployment should be fully automated, with the user-facing A/B testing process being replaced by feature flags. Using feature flags for user feedback allows for more fine-grained control over which users should see the feature than a partial deployment. For example, they allow selecting a subset of users by percentage of traffic or even by region, rather than an arbitrary number of requests arriving at specific instances. More importantly, if a feature doesn’t perform well, there’s no need to perform a rollback, and waiting for user feedback won’t awkwardly hold up other code changes from being rolled out.

I would recommend to any team thinking of switching to continuous deployment to replace manual deployment steps with a combination of feature flags and automated tests before they open the gate to production. In a continuous deployment pipeline, the deployment to production should always be fully automated.

Observability and Monitoring

However sophisticated your code reviews, testing, and deployment strategies may be, production issues can still occur. Sometimes this can happen a long time after the latest deployment, as the necessary conditions for problems to surface can appear randomly.

That is why it is fundamental that developers are able to get quality information about the status of the production, and that the information is highly visible in information radiators available to the whole team.

Observability refers to the ability to monitor and understand the behavior of the deployed system by examining its outputs, such as logs, metrics, and traces. It represents the fundamental ability to ask new questions about the running system, affording an exploratory way to understand it. It also allows teams to identify and diagnose problems, as well as to get insights on how the system is functioning (or failing) under different conditions.

If you are in doubt as to what you should monitor, at least on the technical side, Google provides an excellent starting point in its SRE book, which describes four golden signals:

  • Latency, or the time it takes for the system to service a request.

  • Traffic, a measure of how much demand (e.g., HTTP requests, incoming messages, transactions per second) is being placed on the system.

  • Errors, or the rate of errors, especially in comparison to overall traffic.

  • Saturation, or how much of your system’s “capacity” is being used. This could translate to memory and CPU usage, current instances as opposed to your scaling limit, or hard drive fullness.

On the frontend side, you should also keep an eye on the evolution of the following Core Web Vital metrics over time:

Largest contentful paint (LCP)

This is a measure of when the largest element on the page is rendered, which is an indicator of overall load speed as perceived by the user.

Cumulative layout shift (CLS)

When an element changes position from one frame to the next, that is a layout shift. Layout shifts should be kept to a minimum, as they can disrupt the user experience in many ways.

Interaction to next paint (INP)

This is a measure of latency for all click, tap, and keyboard interactions with a page, and in particular, the longest span. This is an indication of the perceived responsiveness of the page.

In addition to purely technical metrics, the team should make sure to also collect data for business-relevant metrics that reflect the application’s domain; for example, the number of searches performed, conversion rates, click-through rates, and bounce rates.

The generation of outputs such as logs, metrics, and traces should be built into every increment of functionality added to the system, for two reasons: to get visibility as early as the very first deployment, and because adding it after the fact can require a redesign of the code, leading to wasteful rework.

All of this information can easily become overwhelming and noisy for developers, which is why it’s also important that the most crucial signals are condensed into easy-to-read dashboards, examples of which are shown in Figures 4-10 and 4-11. It shouldn’t take more than a glance for a developer who is familiar with the system to determine whether it is operating normally. If issues are detected, more detailed information such as individual logs and traces should be available to be searched for debugging purposes in separate spaces.

Figure 4-10. An example dashboard with business metrics
Figure 4-11. An example dashboard with technical metrics

A lot of innovation is happening in the observability space, with tools such as Datadog, Splunk, Prometheus, Grafana, and NewRelic proliferating and seeing more and more adoption in recent years.

Alerts

Keeping an eye on dashboards during day-to-day work is crucial with continuous deployment, but developers cannot be expected to keep their eyes glued to their Datadog tab 100% of the time. That’s why developers should be notified of abnormalities proactively, even if they try their best to pay attention to monitoring tools. This can be achieved through the use of alerts.

Most observability tools offer alerts that notify developers through a variety of channels—Slack notifications, SMS, phone calls, carrier pigeons, and so on—when specific metrics start behaving weirdly. This is a must-have with frequent deployments.

Alerts can be based on a variety of different factors, such as the system’s performance, usage patterns, or the appearance of specific log messages or errors. By configuring them on key indicators, teams can be notified of issues as soon as they occur, allowing them to take proactive steps to address the issue before it becomes critical.

Information versus noise

A lot of alerts and monitors can become overwhelming if not configured properly. When there are too many monitors, or when alerts go off constantly because they are flaky, developers can quickly learn to tune them out and lose interest, potentially ignoring critical issues.

Alerts should be few and meaningful rather than noisy and redundant. For example, on the technical side you might want to alert on only a few key metrics, such as spikes in application errors, out-of-control latency, or insufficient healthy instances. But don’t disregard business-facing metrics. The sudden absence of certain types of requests, for example, can signal that users are having trouble completing a particular flow (e.g., the latest commit has somehow turned the checkout button invisible).

Some teams create new alerts for any new metric, and keep thresholds low for the alert to fire more often. That might seem like a comprehensive approach, but it is not helpful. Alerts crying wolf can be even worse than not having alerts at all. Having no alerts at least keeps the team paying attention as they are aware there is a gap of information, while bad alerts offer a false sense of security by virtue of their existence, despite being ignored most of the time.

Bad information is worse than no information. That is why a team working with continuous deployment should get into the habit of refactoring its observability and alerts with the same attention it reserves for its application code, its automated tests, and its pipeline.

The Datadog team provides a good heuristic on its blog for what constitutes “meaningful” alerts, and helps separate these alerts from noise—alerting on symptoms rather than causes, which is also covered in Google’s SRE book:

Pages [as in “paging someone”] are extremely effective for delivering information, but they can be quite disruptive if overused, or if they are linked to poorly designed alerts. In general, a page is the most appropriate kind of alert when the system you are responsible for stops doing useful work with acceptable throughput, latency, or error rates. Those are the sort of problems that you want to know about immediately.

The fact that your system stopped doing useful work is a symptom—that is, it is a manifestation of an issue that may have any number of different causes. For example: if your website has been responding very slowly for the last three minutes, that is a symptom. Possible causes include high database latency, failed application servers, Memcached being down, high load, and so on. Whenever possible, build your pages around symptoms rather than causes. [...]

Paging on symptoms surfaces real, oftentimes user-facing problems, rather than hypothetical or internal problems. Contrast paging on a symptom, such as slow website responses, with paging on potential causes of the symptom, such as high load on your web servers. Your users will not know or care about server load if the website is still responding quickly, and your engineers will resent being bothered for something that is only internally noticeable and that may revert to normal levels without intervention.8

Stakeholder Trust

In this chapter, we talked a great deal about the technical prerequisites for working safely under continuous deployment. I think it’s necessary to close the chapter with a reflection on the impact of the human factor.

Opening the gate to production to all commits is a trust exercise between stakeholders and their team. As developers who work on the system daily, we have intimate knowledge of all the safety measures we put in place in order to prevent Bad Things™ from happening to production. With continuous deployment, we are the ones who remain in control of quality gates: after all, we are implementing and configuring the automation that will act on our behalf. But our stakeholders cannot say the same. All they might see, from their perspective, is a loss of their chance to provide input and to block dangerous changes before it’s too late. They have no visibility into the meticulousness of the layers and layers of automation that make their approval unnecessary, and they have to trust our word as engineers. We are effectively asking them to relinquish their only power over the path to production. Given this big ask, we should strive to be empathetic to any concern they put forward. Even if we have done an excellent job at implementing a perfect technical foundation, the cultural one might still be the trickiest after all.

Yet, enabling continuous deployment is a team effort, and our stakeholders being on board is necessary in order for them to make the most of this practice. Stakeholders being confident will also make them more patient with any teething pains as the team gets used to this new way of working. So let’s talk about how to make them enthusiastic (rather than fearful) of automated deployments.

How Do We Convince the Boss?

As a consultant, I have had to do my fair share of convincing in the teams where we were close to continuous deployment, but not quite there yet. In my experience, this convincing is best done when little convincing is left to do.

None of the practices discussed in this chapter are needed exclusively by continuous deployment: each of them can be implemented independently and is a more than justifiable investment on its own. They will undoubtedly improve the quality of the application even with a manual gate to production still in place.

Therefore, I would encourage my fellow engineers to put them in place regardless of whether continuous deployment is the final goal, as they will still lead to a more robust implementation of continuous delivery.

Once the team has reached a great level of continuous delivery maturity, painstaking manual testing of every detail will start to feel redundant rather than necessary. When the team has reached that point, my experience is that even stakeholders will learn to find manual testing annoying. That is when it is easiest to suggest going one step further without triggering strong reactions. The suggestion might even be welcomed at that point, as its only consequence will be removing redundant work.

In my experience, this approach not only removes most of the “negotiating” from these conversations, but it also helps the team ensure that it is indeed ready, as it has evaluated how much it still relies on human eyes over automation. Your boss might even appreciate that continuous deployment readiness is a clear and concrete goal that the team can adopt to guide its continual improvement.

This raises the question, “So how do we know when we are ready?”

As you will see in the case studies in Part V, some bold companies such as Auto­Scout24 make the decision to adopt continuous deployment from day one, as soon as they shift to a modern production ecosystem with microservices, feature flags, and so on. However, if your company is a bit more hesitant, the next section might give you some useful pointers.

When Are We Ready?

We have covered a lot of practices here, and it might be tempting to think that each of them has to be gold-plated to perfection before even considering the removal of human steps from our pipelines. I would like to discourage my readers from that line of thinking. As we discussed in Chapter 2, one of the benefits of continuous deployment is that, once enabled, it puts any and all quality gates to a very thorough test. As code goes to production more and more often, any gaps in our processes will expose themselves rapidly, and they can be addressed by the team as they come up. If we wait for our safety nets to be absolutely perfect, we might never end up taking the leap. Doing the painful thing earlier and more often lets the practices refine themselves naturally.

Whether the time is right or not is a difficult question, and ultimately one that each team needs to answer based on the circumstances it finds itself in. That’s why I will answer this question with another question—several, in fact. These are some things I would suggest that you consider so that you can come up with your own conclusions:

  • Is my team aware of all the practices discussed in this chapter?

  • Have we implemented each practice in this chapter? If yes, to what degree of sophistication? And if not, do we have a good reason why we don’t need it?

  • For each practice we have implemented, is every team member working with it confidently rather than ignoring it or circumventing it?

  • If we implemented continuous deployment tomorrow, what type of code defect would worry me the most (e.g., performance impact, security vulnerability, regression on a specific feature)? What type of defect would worry my stakeholders the most? Is the protection against them manual or automated today?

  • If we implemented continuous deployment tomorrow, is there any particular signal from the production system I would especially keep an eye on? Do we have easy-to-access metrics giving us visibility of it today? If there is a degradation, do we already have alerting for that signal?

  • Is there a significant number of defects that are currently only caught by checking changes manually? If yes, what do they have in common? What kind of automation would be necessary to catch them earlier?

  • Given our technical practices, does the manual gate to production feel like a lifesaver today, or does it feel redundant and like an inconvenience? Does every member of the team feel the same? Do our stakeholders feel the same?

These are just some of the questions I like to ask when evaluating whether continuous deployment is the right choice at a particular point in time. Even for those who don’t plan to implement it anytime soon, the process of finding the answers to these questions might lead to a more thorough understanding of the team’s quality strategy and the robustness of its system.

Summary

In this chapter, we talked about some of the practices that our teams should implement before switching to continuous deployment. Some requirements are cultural and organizational, such as stakeholder trust and cross-functional, autonomous teams with a habit of frequent integration and code reviews. The majority of other requirements are technical: zero-downtime deployments, a pipeline with several layers of automated tests, observability, and alerts.

These are not investments that are valuable for continuous deployment only. Rather, they are good practices that stand on their own. This means that they can be implemented as improvements in isolation and still result in great outcomes for the team’s software delivery life cycle. The decision to switch to continuous deployment can be made (or reversed) later at no loss.

This foundation of practices can ensure that removal of the final gate to production will be as painless as possible.

1 Matthew Skelton and Manuel Pais, Team Topologies: Organizing Business and Technology Teams for Fast Flow (Portland, OR: IT Revolution Press, 2019).

2 Nicole Forsgren et al., Accelerate: Building and Scaling High Performing Technology Organizations (Portland, OR: IT Revolution Press, 2018), p. 91.

3 Although this is not well-known, programming used to be a job held by women. In fact, the first programmers in the early 19th century were female. To learn more, visit https://oreil.ly/cQb11.

4 “Jean Bartik, ENIAC’s Programmers,” Computer History Museum, 2011, video, https://oreil.ly/4S38P.

5 Michael Feathers, Working Effectively with Legacy Code (Boston: Pearson, 2004), p. 16.

6 Kent Beck, Test Driven Development: By Example (Boston: Addison-Wesley, 2002).

7 Steve Freeman and Nat Pryce, Growing Object-Oriented Software, Guided by Tests (Boston: Addison-Wesley, 2009), p. 7.

8 Alexis Lê-Quôc, “Monitoring 101: Alerting on what matters,” Datadog, 2016, https://oreil.ly/M3Wzn.

Get Continuous Deployment now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.