What is DevOps (yet again)?

Empathy, communication, and collaboration across organizational boundaries.

By Mike Loukides
February 3, 2015
Electrical substation in the Cherepovo neighbourhood of Daugavpils Electrical substation in the Cherepovo neighbourhood of Daugavpils (source: mihalich.barak)

I might try to define DevOps as the movement that doesn’t want to be defined. Or as the movement that wants to evade the inevitable cargo-culting that goes with most technical movements. Or the non-movement that’s resisting becoming a movement. I’ve written enough about “what is DevOps” that I should probably be given an honorary doctorate in DevOps Studies.

Baron Schwartz (among others) thinks it’s high time to have a definition, and that only a definition will save DevOps from an identity crisis. Without a definition, it’s subject to the whims of individual interest groups, and ultimately might become a movement that’s defined by nothing more than the desire to “not be like them.” Dave Zwieback (among others) says that the lack of a definition is more of a blessing than a curse, because it “continues to be an open conversation about making our organizations better.” Both have good points. Is it possible to frame DevOps in a way that preserves the openness of the conversation, while giving it some definition? I think so.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

DevOps started as an attempt to think long and hard about the realities of running a modern web site, a problem that has only gotten more difficult over the years. How do we build and maintain critical sites that are increasingly complex, have stringent requirements for performance and uptime, and support thousands or millions of users? How do we avoid the “throw it over the wall” mentality, in which an operations team gets the fallout of the development teams’ bugs? How do we involve developers in maintenance without compromising their ability to release new software?

We’re still learning what the right questions are. We’re figuring out that all systems are complex systems, that understanding what happened when something goes wrong is more important than figuring out who to blame, and that collaboration between different branches of an organization almost always beats isolation into silos. And we’re discovering that the answers have ramifications that extend far beyond web operations.

The journey to 2015

Back in the early days of the web, web sites were hosted on single servers, located in your machine room (or your bedroom closet). Now, that’s only the case when you’re just starting out, and maybe not even then. But as soon as you go to your second server, as soon as you move to a hosting provider, or start up your first AWS instance, you have a nest of problems that the webmasters of the 90s never dreamed of. You’re dealing with complex systems that are only partly under your control and that can fail in very complex, non-deterministic, confusing ways. How do you deal with that?

Software development wasn’t a new art — but web developers soon found out that the methods and best practices that worked for previous generations of software weren’t helpful. We used to release software once a year, if that. Make a golden master, ship it to a CD pressing house, and be done with it. When you have a million CDs sitting in your warehouse, you’re not going to use phrases like Release Early, Release Often. But that’s not how software has worked for the past decade or so. With GMail, Google pioneered the continuous beta process of frequent software releases. If the entire application runs in the user’s browser, there’s no reason you can’t deliver an improved version with every new HTTP connection. Now, continuous beta isn’t an option, it’s a necessity. Web sites are complex applications; if there’s a bug or a misfeature, there’s no need to wait until next year’s release to fix it. And there’s every reason not to wait: that web site exists for a reason, and if it’s broken, it needs to be fixed. We’ve discovered that we can release new versions many times a day; rather than making the process worse and more chaotic, it makes it simpler. Improving or fixing one feature at a time, in an environment where you can easily roll back to the previous version, really is more effective than doing annual releases with thousands of unrelated changes and bug-fixes.

How do you manage the development process in an environment like this? You can’t use age-old waterfall methodology to release software several times a day. Specifications are still important, but you aren’t working to a spec in the same way: you don’t define what you want to do in several months of meetings, then spend the next year or so implementing it. This isn’t to say that the waterfall methodology is no longer important. I’m sure there are situations where collecting requirements up front, writing a spec, and implementing to that spec, then testing it, are appropriate. But that process is almost completely irrelevant to the Web world, and fails more often than it succeeds.

In the past decade, we’ve learned that automation is the key to managing a constantly changing application. Your infrastructure has to become code, not just incantations mumbled by your admins. And your developers have to work with frequent releases in mind: constantly and methodically testing, making releases one minor change at a time, rolling back when necessary, and (above all) doing their share of carrying the pager. If you release software once or twice a year, the release process becomes a kind of folklore that lives in the minds of a few release engineers. But if you’re going to be doing it several times a day, you can’t be doing it all by hand. That’s too error-prone, and too labor-intensive.

These principles apply to web development, but we’re increasingly learning that they apply to the network infrastructure itself. Think it’s just a huge bunch of wires connecting a lot of racks? Those wires aren’t going away, but managing what they’re doing takes software. If you’ve done much work on networks, you’re probably familiar with Cisco’s IOS or Juniper’s Junos, or some other network operating system. They’re extremely flexible, but not amenable to automation. The next generation of networking hardware will implement software-defined networking, and our infrastructure really will be code. Another barrier, this time between developers and network operations, will fall.

That won’t be the end of the story. The practices we’ve learned in DevOps don’t just apply to the technical teams, they apply to the business as a whole. And they apply to old-line “enterprises” that may not even see themselves as having a web product. Patrick Debois has written about the need to “optimize the whole, not the individual silos.” We’re learning that engineering and operations staff need to respect each other. That’s great. Want something more radical? What if engineering and sales respected each other? What if product designers respected their customers? What if upper management respected their employees? One problem with the name “DevOps” is that it’s way too narrow. In Debois’ words, we’re after the “whole value chain.”

Artifacts and substance

DevOps tools and practices? There’s been a lot of talk about tools and practices, but we’re after the whole value chain, not just the tools and practices. Most shops in the DevOps not-a-movement write software that lives in the cloud, and use Puppet, Chef or some other tool for automating configuration. We’re increasingly seeing Docker and containerization play a key role. And there may even be practices. Are blameless post-mortems a “practice”? Yes. Is listening attentively to your colleagues a “practice”? That sounds too obvious to be called a “practice,” though I can imagine reading about “the practice of listening” in a bad management book.

But mistaking tools and the practices for the substance of DevOps is the best way to destroy all the value it has. The Agile software movement has demonstrated where a focus on tools and practices leads. Has stand up meetings? Has Jenkins? Has unit tests? Must be agile. That may excite a pointy-haired boss, but focusing on the artifacts rather than the substance is a great way to make sure that you’re no more agile than the next waterfall. Believe me, I’ve seen it and it’s not pretty. That’s one reason we’d prefer to talk about the Distributed Development Stack (DDS): let’s not let the tools confuse the issue. Waterfall shops will figure out how to use Chef and Docker; they’re probably doing it already. After all, they also build complex distributed systems, and face many of the same problems for operating and maintaining their systems. Tools will come and go, but the problem of optimizing the entire value chain will be with us forever.

We can listen to John Allspaw and Dave Zwieback talk all day about conducting post-mortems (believe me, we can, and you should, too). Much of John’s thinking has its origins outside of the Web world, in the disciplines of Human Factors Engineering and Safety Science. And similarly, much of Velocity co-founder Jesse Robbins’ thinking has its origins in emergency services and Jesse’s experience as a volunteer fireman. If we limit DevOps to devs and ops, we’re limiting the cultural change we can bring about in our organizations, not to mention the improvements we can achieve in efficiency, in reliability, in safety, in customer satisfaction.

Empathy is the way forward

Whenever I think about what DevOps means, I always come back to Jeff Sussna’s formulation: Empathy is the essence of DevOps. (You really owe it to yourself to read all of Jeff’s articles.) You can criticize that as being impossibly vague, but I believe it captures what this phenomenon is about. We’re still learning what it means to empathize. It’s not just sitting down and having a good cry together. It’s about understanding what other parts of the system are trying to say: other people, certainly, but not just the people in your development group, or in your development and operations groups. And once we grasp that, we’re way beyond tools.

We’re thinking about what we really need to do to satisfy our customers. We’re thinking about how to interact creatively and productively with marketing, sales, the warehouse, upper management, even (gasp) accounting. What do our customers really want? What do our operations staff need so they can run our software systems effectively? What do the developers need to create systems that can be managed effortlessly? Where is the pain in the organization, and what do we do about it? If we’re just providing solutions, and not understanding the problem that needs to be solved, from the perspective of the person who has the problem, we’re not doing anything. And if we limit ourselves to devs and ops, well — I suppose we’ll accomplish something, but we’re setting our sights way too low. These ideas are far too important and revolutionary.

I’ve seen arguments that empathy gets in the way of making necessary hard-headed business decisions. Give me a break. When I read about people taking four hours to cancel a COMCAST contract, or when I look at some of the horrendous ecommerce sites out there, or look at custom software systems that were built without asking users what was wrong with the old system, my reaction isn’t that we need more hard-headed business decisions about short-term profit. Have you ever dealt with your insurance company? Then you’ve probably run into someone whose job wasn’t to serve you, but to frustrate you. We need empathy, creative thinking and participation with the users we’re trying to serve. That’s not easy — and it may be the hardest hard-headed business decision of all. Abusing your users so you can continue to collect their money is taking the easy way out, and is destroying value, not creating it. It’s great for pumping up executive bonuses, but not so great for creating long-term value.

On top of empathy, we need a lot of humility. It’s all too easy to make it sound like IBM can solve all its organizational issues by having a big group hug, all 400,000 of them. We have some ideas that have proven effective in small and even mid-sized groups, but we haven’t solved all the problems. We’re still figuring out how empathy, respect, promises, and even continuous deployment apply to large enterprises. A lot of DevOps is about eliminating silos, fighting silo-ization; I’ve written about that myself. One of my earliest observations about code performance is that the most important optimizations often occur across carefully architected boundaries: between the inside and outside of a loop, between carefully untangled methods, between classes, between modules or packages. The same principle applies to organizations. It’s easy to optimize within a group or department, but it is both harder and more important to optimize interactions that cross the boundaries between silos.

It’s also obvious that, when your organization grows to much over 50 people, you need some kind of structure. You need some way of separating what you’re doing from what everyone else is doing. If you take my observation about optimization to extremes, your code will end up as one big main() function that’s impossible to understand or debug. It’s been a long time since we’ve thought that was a reasonable way to build code. There’s a value to structure, both in organizations and in code, and it’s worth sacrificing some performance to gain the benefits of structure. Only naive programmers obfuscate their code to achieve 0.1% gains.

The real art in performance tuning is understanding which optimizations are worth the loss in clarity. The same is true for organizations. We may want developers to take the pagers from the ops staff occasionally, but we probably don’t want developers making financial projections for a day. We may want developers to spend more time helping HR understand the their real hiring needs, and we may want them to empathize with the hard work that goes into negotiating benefits packages, but we probably don’t want them interpreting state and federal employment regulations. The key is to create barriers between groups without making those barriers impermeable: to support separation but not crystallization. But that’s easy to say, and a lot harder to do.

Infrastructure as code, blameless post-mortems, automate all the things, containerize all the things: all these slogans are great as long as we realize that they’re only slogans. At the bottom, I think DevOps is about doing the right thing in any situation: again, easy to say, not so easy to do. But even a hopelessly vague statement like that allows us to see that the slogans are just shortcuts to thinking you’ve solved the problem, not the solutions themselves.

Continuous learning

As John Allspaw says about post-mortems, it’s not about asking “five whys,” or finding out who to blame. John is sadly correct in his conclusion that the end result of a traditional post-mortem is that the problem was caused by “human error.” People don’t set out to cause failure. A post-mortem should be about learning: what happened, how did you notice that it was happening, what were you trying to achieve when it happened, and more. It’s not as simple as saying “we’ve found a root cause, so let’s add some rules to the system so that can’t happen,” any more than it’s about “we’ve put everything in containers, we have attained DevOps.” When you’re dealing with complex systems, simple-minded solutions are almost always wrong solutions. You don’t get anywhere by applying the slogan-du-jour. You do get somewhere by facing up to difficult problems and finding an intelligent approach that takes all aspects of the situation into account. That solution might even be simple, but it won’t be simple-minded.

A better description of DevOps than “doing the right thing in any situation” might be “optimizing the performance of our organizations, regardless of what they do, or their size.” We’re trying to release software that is more reliable, and operate our systems in ways that are more reliable. We want that software to satisfy our users’ needs. We want that software to perform well: to load and run quickly. But more than that, we want the organization to perform well. We want it to be able to respond quickly to changes in market conditions. We want the organization to satisfy its customers, so they’ll become better customers. And we want the organization to get the most out of its members. These goals extend far beyond the engineering team: they’re goals for the company as a whole, and they apply to the whole organization: sales, HR, facilities, finance.

Most of us are used to command and control power structures. Senior management makes strategy and delegates, workers implement the strategy. But note the shift the John introduces into post-mortems: it’s not about blame, a failure of someone at a lower level. John’s assumption, again, is that no one sets out to cause failure. Workers are intelligent people trying to do the right thing under circumstances that can suddenly and unexpectedly become surprisingly difficult. The important question is what happened, and how. That requires listening, and true listening requires empathy — which requires rethinking organizational behavior. We need to go beyond command-and-control, top-down organizations in which the only thing someone at a low level of the pyramid can do is fail. Instead, we need to think about what our staff are capable of. What do they think they can do? What can they promise? These ideas are captured in Mark Burgess’ Promise Theory, and they’re the basis for building a learning, listening and adaptive organization.

In our technical world, we talk a lot about complex distributed systems. But those complex distributed systems built from hardware and software are really only extensions of much more complex distributed systems built by humans. To improve the performance of those systems, we don’t just change software; we have to change culture, we have to change expectations, we have to facilitate communications, we have to start listening to each other and learning. Furthermore, as Debois points out, we have to realize that organizations are constantly in flux, much as our code is. Bottlenecks that we’ve put great effort into solving frequently reappear, either as a result of poor local decision making, or as an unforeseen consequence of something apparently unrelated.

The goals of DevOps are thus much larger than facilitating the deployment and operation of web sites. DevOps extends to all aspects of large, complex origins. It’s about managing organizational behavior, in the broadest sense. We may need to discard the term DevOps as too tied to its historical origins in the web world. I believe that the name is still useful, though it probably hurts adoption in other parts of the organization and in established enterprises that aren’t web-native. Nor does it help us think outside our boxes when applying the lessons learned in web operations outside of the development and operations groups. Empathy, communication, collaboration across organizational boundaries, optimizing the organization as a whole: these all get to the essence of what we’re trying to achieve, whether we call it DevOps or EntOps or something else.

Post topics: Next Architecture, Operations
Post tags: Deep Dive
Share:

Get the O’Reilly Radar Trends to Watch newsletter