SOFTWARE SHOULD BE DESIGNED, WRITTEN, AND DEPLOYED IN SMALL BATCHES. Doing so is good for developers, the product, and operations, too.
The batch size is the unit at which work products move between stages in a development process. For software, the easiest batch to see is code. Every time an engineer checks in code, he is batching up a certain amount of work. There are many techniques for controlling these batches, ranging from the tiny batches needed for continuous deployment to more traditional branch-based development, where all of the code from multiple developers working for weeks or months is batched up and integrated together.
It turns out that there are tremendous benefits from working with a batch size radically smaller than traditional practice suggests. In my experience, a few hours of coding is enough to produce a viable batch and is worth checking in and deploying. Similar results apply in product management, design, testing, and even operations. This is actually a hard case to make, because most of the benefits of small batches are counterintuitive.
The sooner you pass your work on to a later stage, the sooner you can find out how that next stage will receive it. If you’re not used to working in this way, it may seem annoying to get interrupted so soon after you were “done” with something, instead of just working it all out by yourself. But these interruptions are actually much more efficient when you get them soon, because you’re that much more likely to remember what you were working on. And as we’ll see in a moment, you may also be busy building subsequent parts that depend on mistakes you made in earlier steps. The sooner you find out about these dependencies, the less time you’ll waste having to unwind them.
This is easiest to see in deployment. When something goes wrong with production software, it’s almost always because of an unintended side effect of some piece of code. Think about the last time you were called upon to debug a problem like that. How much of the time you spent debugging was actually dedicated to fixing the problem, compared to the time it took to track down where the bug originated?
Amongst many Yahoo! properties, including the largest ones, which have quite dialed-in ops teams, Flickr’s MTTD was insanely low because of this. Since only a handful of lines are changed when they are deployed, changes that do cause regressions or unexpected performance issues are quickly identified and fixed. And of course, the MTTR (Mean Time To Resolve) is much lower as well, because the number of changes needed to fix or roll back is not only finite, but also small.
An example of this is integration risk, which we use continuous integration (http://startuplessonslearned.com/2008/12/continuous-integration-step-by-step.html) to mitigate. Integration problems happen when two people make incompatible changes to some part of the system. These come in all shapes and sizes. You can have code that depends on a certain configuration that’s deployed on production. If that configuration changes before the code is deployed, the person who changes it won’t know he’s introduced a problem. That code is now a ticking time bomb, waiting to cause trouble when it’s deployed.
When the explosion comes, it’s usually operations that bears the brunt. After all, it would never have happened without that change in configuration (never mind that it also wouldn’t have happened without that new code being written, either). New code is generally perceived as valuable forward progress. Configuration changes are a necessary overhead. Reducing the odds of them colliding makes everyone’s life better. This is counterintuitive. It seems like having more releases will lead to increased odds of things going wrong. As we’ll see, that’s not actually correct. Slowing down the release process doesn’t actually reduce the total number of changes—it just combines them into ever-larger batches.
In my experience, this is the most counterintuitive of its effects. Most organizations have their batch size tuned so as to reduce their overhead. For example, if QA takes a week to certify a release, it’s likely that the company does releases no more than once every 30 or 60 days. Telling such a company that it should work in a two-week batch size sounds absurd—the company would spend 50% of its time waiting for QA to certify the release! But this argument is not quite right. This is something so surprising that I didn’t really believe it the first few times I saw it in action. It turns out that organizations get better at the things they do very often. So, when we start checking in code more often, release more often, or conduct more frequent design reviews, we can actually do a lot to make those steps dramatically more efficient.
Of course, that doesn’t necessarily mean we will make those steps more efficient. A common line of argument is: if we have the power to make a step more efficient, why don’t we invest in that infrastructure first, and then reduce the batch size as we lower the overhead? This makes sense, and yet it rarely works. The bottlenecks that large batches cause are often hidden, and it takes work to make them evident, and even more work to invest in fixing them. When the existing system is working “good enough” these projects inevitably languish.
These changes pay increasing dividends, because each improvement now directly frees up somebody in QA or operations while also reducing the total time required for the certification step. Those freed-up resources might be able to spend some of that time helping the development team actually prevent bugs in the first place, or just take on some of their routine work. That frees up even more development resources, and so on. Pretty soon, the team can be developing and testing in a continuous feedback loop, addressing micro-bottlenecks the moment they appear. If you’ve never had the chance to work in an environment like this, I highly recommend you try it. I doubt you’ll go back.
Let me show you what this looked like for the operations and engineering teams at IMVU (http://www.imvu.com/). We had made so many improvements to our tools and processes for deployment that it was pretty hard to take the site down. We had five strong levels of defense:
Each engineer had his own sandbox that mimicked production as closely as possible (whenever it diverged, we’d inevitably find out in a “Five Whys” [http://startuplessonslearned.com/2008/11/five-whys.html] shortly thereafter).
We had a comprehensive set of unit, acceptance, functional, and performance tests, and practiced test-driven development (TDD) across the whole team. Our engineers built a series of test tags, so you could quickly run a subset of tests in your sandbox that you thought were relevant to your current project or feature.
One hundred percent of those tests ran, via a continuous integration cluster, after every check-in. When a test failed, it would prevent that revision from being deployed.
When someone wanted to do a deployment, we had a completely automated system that we called the cluster immune system. This would deploy the change incrementally, one machine at a time. That process would continually monitor the health of those machines, as well as the cluster as a whole, to see if the change was causing problems. If it didn’t like what was going on, it would reject the change, do a fast revert, and lock deployments until someone investigated what went wrong.
We had a comprehensive set of Nagios alerts that would trigger a pager in operations if anything went wrong. Because Five Whys kept turning up a few key metrics that were hard to set static thresholds for, we even had a dynamic prediction algorithm that would make forecasts based on past data and fire alerts if the metric ever went out of its normal bounds.
So, if you had been able to sneak into the desks of any of our engineers, log in to their machines, and secretly check in an infinite loop on some highly trafficked page, here’s what would have happened. Somewhere between 10 and 20 minutes later, they would have received an email with a message that read something like this:
Thank you so much for attempting to check in revision 1234. Unfortunately, that is a terrible idea, and your change has been reverted. We’ve also alerted the whole team to what’s happened and look forward to you figuring out what went wrong.
Best of luck,
(OK, that’s not exactly what it said, but you get the idea.)
The goal of continuous deployment is to help development teams drive waste out of their process by simultaneously reducing the batch size (http://startuplessonslearned.com/2009/02/work-in-small-batches.html) and increasing the tempo of their work. This makes it possible for teams to get—and stay—in a condition of flow for sustained periods. This condition makes it much easier for teams to innovate, experiment, and achieve sustained productivity, and it nicely complements other continuous improvement systems, such as Five Whys, which we’ll discuss later in this chapter.
One large source of waste in development is double-checking. For example, imagine a team operating in a traditional waterfall development system, without continuous deployment, TDD, or continuous integration. When a developer wants to check in code, or an ops staff member thinks he’s ready to push a release, this is a very scary moment. He has a choice: do it now, or double-check to make sure everything still works and looks good. Both options are attractive. If he proceeds now, he can claim the rewards of being done sooner. On the other hand, if he causes a problem, his previous speed will be counted against him. Why didn’t he spend just another five minutes making sure he didn’t cause that problem? In practice, how people respond to this dilemma is determined by their incentives, which are driven by the culture of their team. How severely is failure punished? Who will ultimately bear the cost of their mistakes? How important are schedules? Does the team value finishing early?
But the thing to notice in this situation is that there is really no right answer. People who agonize over the choice reap the worst of both worlds. As a result, people will tend toward two extremes: those who believe in getting things done as fast as possible, and those who believe that work should be carefully checked. Any intermediate position is untenable over the long term. When things go wrong any nuanced explanation of the trade-offs involved is going to sound unsatisfying. After all, you could have acted a little sooner or a little more carefully—if only you’d known what the problem was going to be in advance. Viewed through the lens of hindsight, most of those judgments look bad. On the other hand, an extreme position is much easier to defend. Both have built-in excuses: “Sure there were a few bugs, but I consistently overdeliver on an intense schedule, and it’s well worth it,” or “I know you wanted this done sooner, but you know I only ever deliver when it’s absolutely ready and it’s well worth it.”
These two extreme positions lead to factional strife, which is extremely unpleasant. Managers start to make a note of who’s part of which faction and then assign projects accordingly. Got a crazy last-minute feature? Get the Cowboys to take care of it—and then let the Quality Defenders clean it up in the next release. Both sides start to think of their point of view in moralistic terms: “Those guys don’t see the economic value of fast action, they only care about their precious architecture diagrams,” or “Those guys are sloppy and have no professional pride.” Having been called upon to mediate these disagreements many times in my career, I can attest to just how wasteful they are.
However, they are completely logical outgrowths of a large-batch-size development process that forces developers to make trade-offs between time and quality, using the old “time-quality-money, pick two fallacy” (http://startuplessonslearned.com/2008/10/engineering-managers-lament.html). Because feedback is slow in coming, the damage caused by a mistake is felt long after the decisions that caused the mistake were made, making learning difficult. Because everyone gets ready to integrate with the release batch around the same time (there being no incentive to integrate early), conflicts are resolved under extreme time pressure. Features are chronically on the bubble, about to get deferred to the next release. But when they do get deferred, they tend to have their scope increased (“After all, we have a whole release cycle, and it’s almost done...”), which leads to yet another time crunch, and so on. And of course, the code rarely performs in production the way it does in the testing or staging environment, which leads to a series of hotfixes immediately following each release. These come at the expense of the next release batch, meaning that each release cycle starts off behind.
You can’t change the underlying incentives of this situation by getting better at any one activity. Better release planning, estimating, architecting, or integrating will only mitigate the symptoms. The only traditional technique for solving this problem is to add in massive queues in the forms of schedule padding, extra time for integration, code freezes, and the like. In fact, most organizations don’t realize just how much of this padding is already going on in the estimates that individual contributors learn to generate. But padding doesn’t help, because it serves to slow down the whole process. And as all development teams will tell you, time is always short. In fact, excess time pressure is exactly why they think they have these problems in the first place.
So, we need to find solutions that operate at the system level to break teams out of this pincer action. The Agile software movement has made numerous contributions: continuous integration, which helps accelerate feedback about defects; story cards and Kanban that reduce batch size; a daily stand-up that increases tempo. Continuous deployment is another such technique, one with a unique power to change development team dynamics for the better.
First, continuous deployment separates two different definitions of the term release. One is used by engineers to refer to the process of getting code fully integrated into production. Another is used by marketing to refer to what customers see. In traditional batch-and-queue development, these two concepts are linked. All customers will see the new software as soon as it’s deployed. This requires that all of the testing of the release happens before it is deployed to production, in special staging or testing environments. And this leaves the release vulnerable to unanticipated problems during this window of time: after the code is written but before it’s running in production. On top of that overhead, by conflating the marketing release with the technical release, the amount of coordination overhead required to ship something is also dramatically increased.
Under continuous deployment, as soon as code is written it’s on its way to production. That means we are often deploying just 1% of a feature—long before customers would want to see it. In fact, most of the work involved with a new feature is not the user-visible parts of the feature itself. Instead, it’s the millions of tiny touch points that integrate the feature with all the other features that were built before. Think of the dozens of little API changes that are required when we want to pass new values through the system. These changes are generally supposed to be “side-effect free,” meaning they don’t affect the behavior of the system at the point of insertion—emphasis on supposed. In fact, many bugs are caused by unusual or unnoticed side effects of these deep changes. The same is true of small changes that only conflict with configuration parameters in the production environment. It’s much better to get this feedback as soon as possible, which continuous deployment offers.
Continuous deployment also acts as a speed regulator. Every time the deployment process encounters a problem, a human being needs to get involved to diagnose it. During this time, it’s intentionally impossible for anyone else to deploy. When teams are ready to deploy, but the process is locked, they become immediately available to help diagnose and fix the deployment problem (the alternative—that they continue to generate, but not deploy, new code—just serves to increase batch sizes to everyone’s detriment). This speed regulation is a tricky adjustment for teams that are accustomed to measuring their progress via individual efficiency. In such a system, the primary goal of each engineer is to stay busy, using as close to 100% of his time for coding as possible. Unfortunately, this view ignores the team’s overall throughput. Even if you don’t adopt a radical definition of progress, such as the “validated learning about customers” definition (http://startuplessonslearned.com/2009/04/validated-learning-about-customers.html) that I advocate, it’s still suboptimal to keep everyone busy. When you’re in the midst of integration problems, any code that someone is writing is likely to have to be revised as a result of conflicts. The same is true with configuration mismatches or multiple teams stepping on one other’s toes. In such circumstances, it’s much better for overall productivity for people to stop coding and start talking. Once they figure out how to coordinate their actions so that the work they are doing doesn’t have to be reworked, it’s productive to start coding again.
Returning to our development team divided into Cowboy and Quality factions, let’s take a look at how continuous deployment can change the calculus of their situation. For one, continuous deployment fosters learning and professional development—on both sides of the divide. Instead of having to argue with each other about the right way to code, each individual has an opportunity to learn directly from the production environment. This is the meaning of the axiom to “let your defects be your teacher.”
If an engineer has a tendency to ship too soon, he will tend to find himself grappling with the cluster immune system (http://startuplessonslearned.com/2008/09/just-in-time-scalability.html), continuous integration server, and Five Whys master more often. These encounters, far from being the high-stakes arguments inherent in traditional teams, are actually low-risk, mostly private or small-group affairs. Because the feedback is rapid, Cowboys will start to learn what kinds of testing, preparation, and checking really do let them work faster. They’ll be learning the key truth that there is such a thing as “too fast”—many quality problems actually slow you down.
Engineers who have a tendency to wait too long before shipping also have lessons to learn. For one, the larger the batch size of their work, the harder it will be to get it integrated. At IMVU, we would occasionally hire someone from a more traditional organization who had a hard time letting go of his “best practices” and habits. Sometimes he’d advocate for doing his work on a separate branch and integrating only at the end. Although I’d always do my best to convince such people otherwise, if they were insistent I would encourage them to give it a try. Inevitably, a week or two later I’d enjoy the spectacle of watching them engage in something I called “code bouncing.” It’s like throwing a rubber ball against a wall. In a code bounce, someone tries to check in a huge batch. First he has integration conflicts, which requires talking to various people on the team to know how to resolve them properly. Of course, while he is resolving the conflicts, new changes are being checked in. So, new conflicts appear. This cycle repeats for a while, until he either catches up to all the conflicts or just asks the rest of the team for a general check-in freeze. Then the fun part begins. Getting a large batch through the continuous integration server, incremental deploy system, and real-time monitoring system almost never works on the first try. Thus, the large batch gets reverted. While the problems are being fixed, more changes are being checked in. Unless we freeze the work of the whole team, this can go on for days. But if we do engage in a general check-in freeze, we’re driving up the batch size of everyone else—which will lead to future episodes of code bouncing. In my experience, just one or two episodes is enough to cure anyone of his desire to work in large batches.
Because continuous deployment encourages learning, teams that practice it are able to get faster over time. That’s because each individual’s incentives are aligned with the goals of the whole team. Each person works to drive down waste in his own work, and this true efficiency gain more than offsets the incremental overhead of having to build and maintain the infrastructure required to do continuous deployment. In fact, if you practice Five Whys too, you can build this entire infrastructure in a completely incremental fashion. It’s really a lot of fun.
Continuous deployment is controversial. When most people first hear about continuous deployment, they think I’m advocating low-quality code (http://www.developsense.com/2009/03/50-deployments-day-and-perpetual-beta.html) or an undisciplined Cowboy-coding development process (http://lastinfirstout.blogspot.com/2009/03/continuous-deployment-debate.html). On the contrary, I believe that continuous deployment requires tremendous discipline and can greatly enhance software quality, by applying a rigorous set of standards to every change to prevent regressions, outages, or harm to key business metrics. Another common reaction I hear to continuous deployment is that it’s too complicated, it’s time-consuming, or it’s hard to prioritize. It’s this latter fear that I’d like to address head-on in this chapter. Although it is true that the full system we use to support deploying 50 times a day at IMVU is elaborate, it certainly didn’t start that way. By making a few simple investments and process changes, any development team can be on their way to continuous deployment. It’s the journey, not the destination, which counts. Here’s the why and how, in five steps.
This is the backbone of continuous deployment. We need a centralized place where all automated tests (unit tests, functional tests, integration tests, everything) can be run and monitored upon every commit. Many fine, free software tools are available to make this easy—I have had success with Buildbot (http://buildbot.net). Whatever tool you use, it’s important that it can run all the tests your organization writes, in all languages and frameworks.
If you have only a few tests (or even none at all), don’t despair. Simply set up the continuous integration server and agree to one simple rule: we’ll add a new automated test every time we fix a bug. If you follow that rule, you’ll start to immediately get testing where it’s needed most: in the parts of your code that have the most bugs and therefore drive the most waste for your developers. Even better, these tests will start to pay immediate dividends by propping up that most-unstable code and freeing up a lot of time that used to be devoted to finding and fixing regressions (a.k.a. firefighting).
If you already have a lot of tests, make sure the continuous integration server spends only a small amount of time on a full run; 10 to 30 minutes at most. If that’s not possible, simply partition the tests across multiple machines until you get the time down to something reasonable.
For more on the nuts and bolts of setting up continuous integration, see “Continuous integration step-by-step” (http://startuplessonslearned.com/2008/12/continuous-integration-step-by-step.html).
The next piece of infrastructure we need is a source control server with a commit-check script. I’ve seen this implemented with CVS (http://www.nongnu.org/cvs), Subversion, or Perforce and have no reason to believe it isn’t possible in any source control system. The most important thing is that you have the opportunity to run custom code at the moment a new commit is submitted but before the server accepts it. Your script should have the power to reject a change and report a message back to the person attempting to check in. This is a very handy place to enforce coding standards, especially those of the mechanical variety.
But its role in continuous deployment is much more important. This is the place you can control what I like to call “the production line,” to borrow a metaphor from manufacturing. When something is going wrong with our systems at any place along the line, this script should halt new commits. So, if the continuous integration server runs a build and even one test breaks, the commit script should prohibit new code from being added to the repository. In subsequent steps, we’ll add additional rules that also “stop the line,” and therefore halt new commits.
This sets up the first important feedback loop that you need for continuous deployment. Our goal as a team is to work as fast as we can reliably produce high-quality code—and no faster. Going any “faster” is actually just creating delayed waste that will slow us down later. (This feedback loop is also discussed in detail at http://startuplessonslearned.com/2008/12/continuous-integration-step-by-step.html.)
At IMVU, we built a serious deployment script that incrementally deploys software machine by machine and monitors the health of the cluster and the business along the way so that it can do a fast revert if something looks amiss. We call it a cluster immune system (http://www.slideshare.net/olragon/just-in-time-scalability-agile-methods-to-support-massive-growth-presentation-presentation-925519). But we didn’t start out that way. In fact, attempting to build a complex deployment system like that from scratch is a bad idea.
Instead, start simple. It’s not even important that you have an automated process, although as you practice you will get more automated over time. Rather, it’s important that you do every deployment the same way and have a clear and published process for how to do it that you can evolve over time.
For most websites, I recommend starting with a simple script that just
rsync’s code to a version-specific directory on each target machine. If you are facile with Unix symlinks (http://www.mikerubel.org/computers/rsync_snapshots/), you can pretty easily set this up so that advancing to a new version (and hence, rolling back) is as easy as switching a single symlink on each server. But even if that’s not appropriate for your setup, have a single script that does a deployment directly from source control.
When you want to push new code to production, require that everyone uses this one mechanism. Keep it manual, but simple, so that everyone knows how to use it. And most importantly, have it obey the same “production line” halting rules as the commit script. That is, make it impossible to do a deployment for a given revision if the continuous integration server hasn’t yet run and had all tests pass for that revision.
No matter how good your deployment process is bugs can still get through. The most annoying variety are bugs that don’t manifest until hours or days after the code that caused them is deployed. To catch those nasty bugs, you need a monitoring platform that can let you know when things have gone awry, and get a human being involved in debugging them.
To start, I recommend a system such as the open source Nagios (http://www.nagios.org/). Out of the box, it can monitor basic system stats such as load average and disk utilization. For continuous deployment purposes, we want to be able to have it monitor business metrics such as simultaneous users or revenue per unit time. At the beginning, simply pick one or two of these metrics to use. Anything is fine to start, and it’s important not to choose too many. The goal should be to wire the Nagios alerts up to a pager, cell phone, or high-priority email list that will wake someone up in the middle of the night if one of these metrics goes out of bounds. If the pager goes off too often, it won’t get the attention it deserves, so start simple.
Follow this simple rule: every time the pager goes off, halt the production line (which will prevent check-ins and deployments). Fix the urgent problem, and don’t resume the production line until you’ve had a chance to schedule a Five Whys meeting for root-cause analysis (RCA), which we’ll discuss next.
So far, we’ve talked about making modest investments in tools and infrastructure and adding a couple of simple rules to our development process. Most teams should be able to do everything we’ve talked about in a week or two, at the most, because most of the work involves installing and configuring off-the-shelf software.
Five Whys gets its name from the process of asking “why” recursively to uncover the true source of a given problem. The way Five Whys works to enable continuous deployment is when you add this rule: every time you do an RCA, make a proportional investment in prevention at each of the five levels you uncover. Proportional means the solution shouldn’t be more expensive than the problem you’re analyzing; a minor inconvenience for only a few customers should merit a much smaller investment than a multihour outage.
But no matter how small the problem is, always make some investments, and always make them at each level. Because our focus in this chapter is deployment, this means always asking the question, “Why was this problem not caught earlier in our deployment pipeline?” So, if a customer experienced a bug, why didn’t Nagios alert us? Why didn’t our deployment process catch it? Why didn’t our continuous integration server catch it? For each question, make a small improvement.
Over months and years, these small improvements add up, much like compounding interest. But there is a reason this approach is superior to making a large upfront investment in a complex continuous deployment system modeled on IMVU’s (or anyone else’s). The payoff is that your system will be uniquely adapted to your particular system and circumstance. If most of your headaches come from performance problems in production, you’ll naturally be forced to invest in prevention at the deployment/alerting stage. If your problems stem from badly factored code, which causes collateral damage for even small features or fixes, you’ll naturally find yourself adding a lot of automated tests to your continuous integration server. Each problem drives investments in that category of solution. Thankfully, there’s an 80/20 rule at work: 20% of your code and architecture probably drives 80% of your headaches. Investing in that 20% frees up incredible time and energy that can be invested in more productive things.
Following these five steps will not give you continuous deployment overnight. In its initial stages, most of your RCA will come back to the same problem: “We haven’t invested in preventing that yet.” But with patience and hard work, anyone can use these techniques to inexorably drive waste out of his development process.
Having evangelized the concept of continuous deployment for the past few years, I’ve come into contact with almost every conceivable question, objection, or concern that people have about it. The most common reaction I get is something like “That sounds great—for your business—but that could never work for my application.” Or, phrased more hopefully, “I see how you can use continuous deployment to run an online consumer service, but how can it be used for B2B software?” Or variations thereof.
I understand why people would think that a consumer Internet service such as IMVU isn’t really mission critical. I would posit that those same people have never been on the receiving end of a phone call from a 16-year-old girl complaining that your new release ruined her birthday party. That’s where I learned a whole new appreciation for the idea that mission critical is in the eye of the beholder. But even so, there are key concerns that lead people to conclude that continuous deployment can’t be used in mission-critical situations.
Implicit in these concerns are two beliefs:
Mission-critical customers won’t accept new releases on a continuous basis.
Continuous deployment leads to lower-quality software than software built in large batches.
These beliefs are rooted in fears that make sense. But as is often the case, the right thing to do is to address the underlying cause of the fear (http://www.startuplessonslearned.com/2009/05/fear-is-mind-killer.html) instead of avoiding improving the process. Let’s take each in turn.
Most customers of most products hate new releases. That’s a perfectly reasonable reaction, given that most releases of most products are bad news. It’s likely that the new release will contain new bugs. Even worse, the sad state of product development generally means the new “features” are as likely to be ones that make the product worse, not better. So, asking customers if they’d like to receive new releases more often usually leads to a consistent answer: “No, thank you.” On the other hand, you’ll get a very different reaction if you say to customers, “The next time you report an urgent bug, would you prefer to have it fixed immediately or wait for a future arbitrary release milestone?”
Most enterprise customers of mission-critical software mitigate these problems by insisting on releases on a regular, slow schedule. This gives them plenty of time to do stress testing, training, and their own internal deployment. Smaller customers and regular consumers rely on their vendors to do this for them and are otherwise at their mercy. Switching these customers directly to continuous deployment sounds harder than it really is. That’s because of the anatomy of a release. A typical “new feature” release is, in my experience, about 80% changes to underlying APIs or architecture. That is, the vast majority of the release is not actually visible to the end user. Most of these changes are supposed to be “side-effect free,” although few traditional development teams actually achieve that level of quality. So, the first shift in mindset required for continuous deployment is this: if a change is supposedly “side-effect free,” release it immediately. Don’t wait to bundle it up with a bunch of other related changes. If you do that, it will be much harder to figure out which change caused the unexpected side effects.
The second shift in mindset required is to separate the concept of a marketing release from the concept of an engineering release. Just because a feature is built, tested, integrated, and deployed doesn’t mean any customers should necessarily see it. When deploying end-user-visible changes, most continuous deployment teams keep them hidden behind “flags” that allow for a gradual rollout of the feature when it’s ready. (See the Flickr blog post at http://code.flickr.com/blog/2009/12/02/flipping-out/ for how that company does this.) This allows the concept of “ready” to be much more all-encompassing than the traditional “developers threw it over the wall to QA, and QA approved of it.” You might have the interaction designer who designed it take a look to see if it really conforms to his design. You might have the marketing folks who are going to promote it double-check that it does what they expect. You can train your operations or customer service staff on how it works—all live in the production environment. Although this sounds similar to a staging server, it’s actually much more powerful. Because the feature is live in the real production environment, all kinds of integration risks are mitigated. For example, many features have decent performance themselves but interact badly when sharing resources with other features. Those kinds of features can be immediately detected and reverted by continuous deployment. Most importantly, the feature will look, feel, and behave exactly like it does in production. Bugs that are found in production are real, not staging artifacts.
Plus, you want to get good at selectively hiding features from customers. That skill set is essential for gradual rollouts and, most importantly, A/B split-testing (http://www.startuplessonslearned.com/2008/12/getting-started-with-split-testing.html). In traditional large batch deployment systems, split-testing a new feature seems like considerably more work than just throwing it over the wall. Continuous deployment changes that calculus, making split-tests nearly free. As a result, the amount of validated learning (http://www.startuplessonslearned.com/2009/04/validated-learning-about-customers.html) a continuous deployment team achieves per unit time is much higher.
A traditional QA process works through a checklist of key features, making sure each feature works as specified before allowing the release to go forward. This makes sense, especially given how many bugs in software involve “action at a distance” or unexpected side effects. Thus, even if a release is focused on changing Feature X, there’s every reason to be concerned that it will accidentally break Feature Y. Over time, the overhead of this approach to QA becomes very expensive. As the product grows, the checklist has to grow proportionally. Thus, to get the same level of coverage for each release, the QA team has to grow (or, equivalently, the amount of time the product spends in QA has to grow). Unfortunately, it gets worse. In a successful start-up, the development team is also growing. That means more changes are being implemented per unit time as well, which means either the number of releases per unit time is growing or, more likely, the number of changes in each release is growing. So, for a growing team working on a growing product, the QA overhead is increasing polynomially, even if the team is expanding only linearly.
For organizations that have the highest quality standards, and the budget to do it, full coverage can work. In fact, that’s what happens for organizations such as the U.S. Army, which has to do a massive amount of integration testing of products built by its vendors. Having those products fail in the field would be unacceptable. To achieve full coverage, the Army has a process for certifying these products. The whole process takes a massive amount of manpower and requires a cycle time that would be lethal for most start-ups (the major certifications take approximately two years). And even the Army recognizes that improving this cycle time would have major benefits.
Very few start-ups can afford this overhead, and so they simply accept a reduction in coverage instead. That solves the problem in the short term, but not in the long term—because the extra bugs that get through the QA process wind up slowing the team down over time, imposing extra “firefighting” overhead, too.
I want to directly challenge the belief that continuous deployment leads to lower-quality software. I just don’t believe it. Continuous deployment offers significant advantages over large batch development systems. Some of these benefits are shared by Agile systems which have continuous integration but large batch releases, but others are unique to continuous deployment.
Engineers working in a continuous deployment environment are much more likely to get individually tailored feedback about their work. When they introduce a bug, performance problem, or scalability bottleneck, they are likely to know about it immediately. They’ll be much less likely to hide behind the work of others, as happens with large batch releases—when a release has a bug it tends to be attributed to the major contributor to that release, even if that association is untrue.
Continuous deployment requires living the mantra: “Have every problem only once.” This requires a commitment to realistic prevention and learning from past mistakes. That necessarily means an awful lot of automation. That’s good for QA and for engineers. QA’s job gets a lot more interesting when we use machines for what machines are good for: routine repetitive detailed work, such as finding bug regressions.
To make continuous deployment work, teams have to get good at automated monitoring and reacting to business and customer-centric metrics, not just technical metrics. That’s a simple consequence of the automation principle I just mentioned. Huge classes of bugs “work as designed” but cause catastrophic changes in customer behavior. My favorite: changing the checkout button in an e-commerce flow to appear white on a white background. No automated test is going to catch that, but it still will drive revenue to zero. That class of bug will burn continuous deployment teams only once.
Most QA teams are organized around finding reproduction paths for bugs that affect customers. This made sense in eras where successful products tended to be used by a small number of customers. These days, even niche products—or even big enterprise products—tend to have a lot of man-hours logged by end users. And that, in turn, means that rare bugs are actually quite exasperating. For example, consider a bug that happens only one time in a million uses. Traditional QA teams are never going to find a reproduction path for that bug. It will never show up in the lab. But for a product with millions of customers, it’s happening (and it’s being reported to customer service) multiple times a day! Continuous deployment teams are much better able to find and fix these bugs.
Continuous deployment tends to drive the batch size of work down to an optimal level, whereas traditional deployment systems tend to drive it up. For more details on this phenomenon, see “Work in Small Batches” (http://www.startuplessonslearned.com/2009/02/work-in-small-batches.html) and the section on the “batch size death spiral” in “The Principles of Product Development Flow” (http://www.startuplessonslearned.com/2009/07/principles-of-product-development-flow.html).
I want to mention one last benefit of continuous deployment: morale. At a recent talk, an audience member asked me about the impact of continuous deployment on morale. This manager was worried that moving his engineers to a more rapid release cycle would stress them out, making them feel like they were always firefighting and releasing, and never had time for “real work.” As luck would have it, one of IMVU’s engineers happened to be in the audience at the time. He provided a better answer than I ever could. He explained that by reducing the overhead of doing a release, each engineer gets to work to his own release schedule. That means that as soon as an engineer is ready to deploy, he can. So, even if it’s midnight, if your feature is ready to go, you can check in, deploy, and start talking to customers about it right away. No extra approvals, meetings, or coordination is required. Just you, your code, and your customers. It’s pretty satisfying.