A few years ago, I wrote that DevOps is the movement that doesn't want to be defined. That's still true, though being cryptic doesn't help executives who need to understand what's going on in their industries, or who want to make their digital operations more productive. You may already have people in your company who are "doing DevOps," or who want to. What are they doing? What do they want?
Let's start with origins. Back in 2007, a Belgian engineer named Patrick Debois became frustrated by the friction between developer and operations teams. As a developer and a member of the Agile community, Debois saw an opportunity to use Agile methodologies to manage infrastructure management process, much like developers manage development process. He initially described this concept as Agile Infrastructure but later coined the phrase DevOps, a portmanteau of development and operations.
Then, in 2008, O'Reilly Media ran the first Velocity conference. Velocity was founded on the same insight: that web developers and web operations teams were often in conflict, yet shared the same goals and the same language. It was an effort to gather the tribe into one room, to talk with each other and share insights. Much of the DevOps movement grew out of the early Velocity conferences, and shares the same goal: breaking down the invisible wall that separates developers from IT operations.
Why break down that wall? Because, as websites became larger and more complex, it was painfully obvious that the wall was hindering, rather than helping. DevOps grew up with the largest sites on the web; the ideas behind it came from sites like Amazon, Google, Flickr, Facebook, and YouTube, and from learning how to deal with these sites as they became monsters. It's based on the hard-earned experience of seeing websites grow from one server, to a dozen, to thousands.
As websites became larger and more complex, the business consequences of downtime, or of poor performance, skyrocketed. The business realities of running these giant sites forced developers and operations to unite. Operations wasn't something you could let the Ops group worry about when you declared your project finished. The boundary between development and operations was blurring. To build sites that perform well, scale to millions of users, and are always up, you can't separate development from operations.
You also can’t build websites that perform well and scale if you’re working with development processes that are more appropriate to shrink-wrapped software. Working at “web speed” demands frequent updates: to fix bugs and improve performance, sure, but also to add new features and stay ahead of the competition. When you’re working at web speed, standing still is moving backwards. These sites were quick to develop the processes that became known as continuous deployment (CD), in which software is deployed to the field frequently—often many times a day. CD also requires developers to write software that could be put into production directly; they had to understand how operations works, and how their software fits into an automated deployment pipeline. They couldn’t throw software over a wall and let another team spend time getting it working in the field.
Cloud computing appeared at roughly the same time, and made the boundary between development and operations even more fuzzy. You can't develop for the cloud without understanding operations, and you can't operate software in the cloud without understanding development. You never touch the servers; it’s all software.
So that's how DevOps began: by eliminating the barrier between the people who develop the software and the people who operate it. Push the words together and call it a movement. That simple origin myth hides a lot of complexity, but it will suffice.
Automation and operations at scale
DevOps was the inevitable outcome of building and operating the sites that became the web's giants. It's a native in the world of complex distributed systems and the cloud. And to understand DevOps, you need to think about how every site—not just the Internet monsters—has changed over the past two decades.
When I started at O'Reilly, Tim O'Reilly pointed to a PC sitting on a chair and said, "That's oreilly.com." He can't do that anymore. Nobody can. Like any successful business, O'Reilly runs across many servers: at co-location facilities, at Amazon and at Google, at CDNs like Akamai and Fastly. Some bits of it are in-house; most of it isn't. Some of it isn't even ours: we rely on many external services. Our reliability and performance depend as much on these services as on the services we own and control. It's a complex, multi-tiered distributed system.
When oreilly.com was a single computer sitting on a chair, someone could log in on a console terminal and install updates, fix things, or reboot if something went wrong. (If memory serves, that console terminal was sitting on another chair, next to the one with the computer.) Now that oreilly.com is thousands of computers distributed world-wide, you can't just log in and do an upgrade. You can't give a few commands and reload the site.
At this scale, automation isn't an option. It's a requirement. The slogan goes infrastructure as code; the reality is that infrastructure is code. Software has eaten the hardware. You need the ability to install and update software across your whole network. It's no longer a matter of pushing an update to the web server or the database server by hand; it's pushing updates to thousands of servers, and doing it correctly every time. It's no longer taking a new computer out of a box and installing an operating system and application software by hand; it's dealing with physical machines by the rack, or with virtual machines and Amazon instances by the gross. It's ensuring that all of these machines are configured correctly and identically. That can only be done by automating the process. And the way we automate is to write software.
System administrators used to be heavily involved with shipping and receiving: receiving new hardware, setting it up, configuring it, with each box likely to have a custom configuration. These days, operations staff may rarely see a computer other than their laptop: the machines that do the work are virtual, they live at Amazon or at Microsoft or at Google or at IBM. The job is no longer “racking and stacking”; it’s writing and managing the infrastructure software that keeps everything going. Yes, physical infrastructure still exists, somewhere. But the code that keeps that infrastructure running is what’s important.
In this new world, developers can't just build new features in the lab, and then expect operations to take over and deploy the new software. Developers also need to understand how the systems are configured and how the software is deployed, because it's being installed across thousands of machines, both virtual and physical. They need to work with the Ops team to build and maintain the deployment process, because that process is now software.
The practice of continuous deployment (an extension of continuous delivery), or CD, came out of the seminal websites. CD is built on ideas from the Agile movement, for which a key idea was frequent software releases: every week or two. CD pushes the idea of “frequent releases” a lot farther; releases can take place many times a day. How many times a day? New software developers at Facebook typically push a change to the site their first day on the job. Given the number of developers Facebook hires, that's a lot of updates—but more than that, developers are trained from day one to push changes to production frequently. Updates at that scale can't happen by hand. Not only is it impractical, it's error-prone, even at scales much smaller than Facebook's. Updates must be tested and pushed to servers automatically. Frequently, updates are only sent to a small number of servers until it's clear that the update is stable, or that it actually improves the user's experience.
Each update represents a single small change to the system. Because each update is only a small, easily understood change, there's minimal risk in deploying it. It's just a single bug fix or enhancement, not a year’s worth of fixes rolled up into a major release. If something goes wrong, it’s easy to roll back. If there’s an A/B testing framework in place, it’s easy to deploy the change to a few servers, and see if performance (by whatever metric is appropriate: raw speed, click-through, user satisfaction) improves. The biggest advantage of frequent, small changes, though, is that they’re small: they’re easily understood, they don’t “break the world.”
Continuous deployment requires, minimally, a revision control system; tools for continuous integration and testing; and tools for automated configuration and deployment. There are open source and commercial products for all of these functions.
It’s possible to push continuous deployment a lot further. Containers such as Docker go a long way towards minimizing the gap between development systems and production systems; developers can work on their laptop in an environment that’s almost identical to production, and can ship software that has been packaged to fit seamlessly into the production environment. Kubernetes helps to manage large numbers of containers in production. Microservice architectures split complex software systems into groups of independent services, each of which can be developed and deployed separately. Serverless architectures, based on services like AWS Lambda, push the concept of microservices down to individual functions.
You don’t need to push to the coolest, most buzzword-compliant development practices and architectures. But you do need the ability to update your software systems reliably and confidently at a moment’s notice. That’s what continuous deployment is all about.
Performance and reliability
When you're talking about thousands of systems delivering your services, you also need to know that they're running properly. Monitoring is critical to modern operations: How do you know all your servers are running? How do you know they're healthy? How do you know that your network connections are all working? Are your users leaving because they’re tired of waiting for the site to load? If you have 1000 servers and 2 are down, will there be 500 down in 10 minutes? Or can you just start a couple of new AWS instances and consider the problem solved?
A modern site has to be concerned with performance, even when all the systems are running. A groundbreaking talk from Velocity 2009 showed that delays as small as 100 milliseconds—too small to be consciously noticeable—caused measurable numbers of users to click away. Time is money, indeed; and developers and Ops staff need to work together to ensure that their sites deliver good, consistent performance. Customers are even less willing to wait online than in brick-and-mortar stores.
Performance and reliability are issues that DevOps deals with constantly. When you have a complex system that's spread across the globe, you can't measure performance and uptime by logging into your servers by hand. You need automation, but automation isn’t enough by itself. You also need tools that can measure performance as your users see it. It’s not enough just to prove that your servers are still online. That requires state-of-the-art systems for logging and monitoring.
Capacity planning is critical to keeping your customers satisfied. Can your site handle a 100-fold increase in load on Black Friday? What happens if a new product is unexpectedly successful? Performance and reliability aren’t just a matter of making the site run well during normal conditions; it’s knowing that your systems can handle peak loads without degraded performance. It isn’t enough to say that the site is fine, and your customers will tolerate it if it’s occasionally bad, because they won’t. They’ll go away—just when you want them most. Capacity planning is about knowing that you can handle both normal and abnormal loads: that you can bring additional servers online, and get them running quickly. Cloud computing may simplify the problem, but it certainly doesn’t eliminate it.
Even if you don't think your business is a web business, it has almost certainly grown to a scale where you're running dozens of servers at multiple locations. This is a DevOps world.
Failure is inevitable
Perhaps the most thankless task of any operations group is getting things working again after something has gone wrong. Operations groups have to be concerned with performance and resilience. What happens if network outages occur, or a disk drive suddenly fails? If the database server dies, can you survive?
There are many ways to make sites more resilient. Netflix has pioneered the most intriguing: they advocate a practice called chaos engineering, in which software periodically and randomly "breaks things": it shuts down servers, misconfigures networks, and so on. Chaos engineering acknowledges that outages are a part of normal operation, and that practice is the only way to deal with them. A team that’s aware that something could break at any moment, and that is confident it can fix whatever problems arise, will deal with problems promptly and effectively. Chaos engineering also acknowledges that you can't fix problems that you're not aware of, and that in a complex system, it's difficult to predict trouble areas in advance. Breaking things at random is a more effective way to flush out trouble spots than having meetings to discuss what might fail. While chaos engineering sounds terrifying, it's very difficult to argue with Netflix's success in maintaining a stable, high-performance platform.
The culture of DevOps
There's a continual debate within the DevOps community about whether DevOps is fundamentally about culture, or whether it's about processes (like CD) and tools. I'm sympathetic to the idea that DevOps is first and foremost about creating a culture in which developers and Ops staff work with each other with mutual respect. But to take either side of this debate by itself invites trouble. Effective DevOps is a good resource for balancing the roles of culture, methodology, and tools.
Many tools are associated with DevOps. I've mentioned some of them (in a very general way). But one of the biggest mistakes management can make is mistaking the tools for the reality. "We use Chef to automate deployment" doesn't mean you're doing DevOps. All too often, we see organizations that "automate deployment" without changing anything about how they work: they're using the tool, but they're still doing big, complex, multi-feature releases. Or they’re using the tool in production, but not in development. Managers need to avoid "cargo culting" at all costs: adopting the terminology and the tools without changing anything significant.
A less common, but equally fatal, mistake is to swallow uncritically the line that DevOps is about culture, not tools. So gather all your devs and Ops staff around a campfire and sing Kum Bah Yah. DevOps is all about automating everything that can reasonably be automated, and that inevitably involves tooling. You'll see tools for testing, for continuous integration, for automated configuration and deployment, for creating containers, for managing your services, and more—even for randomly breaking things. The tools can't be ignored.
Focus entirely on culture, or entirely on tools, and you'll lose. You need both.
Complex systems fail. Any operations group needs to deal with system failure, and to evaluate what happened after a failure. All too often, postmortem analysis takes the form of assigning blame: figuring out who or what caused the event, and in the process, being blind to the complex nature of what happened.
Blameless postmortems are an important cultural practice for dealing with system failure. The appropriate response to failure isn't figuring out who to blame, finding a single root cause, or putting in place procedures that try to prevent what just happened from happening again. A postmortem is about understanding exactly what happened, piecing together the complex chain of events that led up to the failure, so that the system as a whole can be improved. In modern distributed systems, most outages are a result of a "perfect storm": several events, none of which would be a problem in itself, combine to bring the system down. Analyzing the storm can't happen when people are afraid of being blamed. And without understanding what happened, you can't make the system more resilient: more resistant to failure, and easier to manage correctly.
The ideas behind DevOps have been extended into other fields, so you will often see other words mashed up with Ops. Here are a few; you’ll undoubtedly find more.
The most natural extension of DevOps is NetworkOps. DevOps started with web servers, and with the mantra "configuration as code." But the network operations staff lived in a different world. The world of network operations was dominated by heavy iron and miles of cable that could only be managed by going to a physical console and typing commands in a proprietary configuration language.
Network infrastructure has become much more manageable in the last decade. There's no reason that network operations can't take advantage of the configuration and deployment tools used by the rest of your staff. Store your network configurations in a revision control system, and automate the process of changing configurations, so it's easy to back out changes that don't work. If your application lives in the cloud, you may have to manage a VPN on Amazon, and the only way to do that is through software.
The development of software-defined networking and network function virtualization takes this trend several steps further. Network configuration can be incorporated directly into the application. You don't manage your network through proprietary configuration languages, such as Cisco’s IOS and Juniper Networks’ JunOS; you call libraries from Python or Java. The network is code; the wires and the big expensive hardware are all secondary. And since the network is code, network operations becomes part of the whole DevOps cycle.
Structurally, data science isn't all that different from the web. You have programmers and developers who build software that runs on a complicated infrastructure: many different databases, platforms like Hadoop and Spark, libraries and frameworks ranging from basic statistics to AI.
It isn't surprising, then, that a distinction between data developers and data operations has appeared. It’s still common for a developer to create a small-scale prototype, and hand it over to a data engineer to be turned into a product and deployed at scale. However, we’re seeing data operations integrated into data science groups. There's even a new specialty, called "machine learning engineer," that deals with the transition between prototypes and well-engineered products that run at scale. ML engineering includes tasks that are unique to data science, such as monitoring applications to determine when models need to be updated.
As with DevOps, distinct specialties aren't a problem; problems arise when the specialties become silos that prevent communication.
DevOpsSec brings security into the DevOps world. Superficially, it seems that security should go hand-in-hand with slow, careful inspection of code, auditing, and careful and painstaking adherence to requirements. You've probably noticed: that isn't working. There's certainly an argument to be made that security is better served by the ability to make and deploy changes rapidly. There’s also an argument to be made that rapid deployment without security is suicidal. And it’s certainly true that automating deployment is less error-prone, and exposes all the details of your configuration, so nothing is left to chance. If you’re installing patches in response to an incident, you need to be confident that the patches are installed correctly, and don’t leave you more vulnerable than you were before.
There's also a cultural argument for DevOpsSec. In many companies, security teams are isolated from development and operations groups; not only do they not work together, at many companies they dislike each other intensely. Fixing the divide between security, development, and operations staff is the first step towards security that works.
A few years ago, some people in the operations community pointed to the possibility of putting all the servers in the cloud, and automating all the administrative tasks, so that an operations group wasn't needed. NoOps: everything would run itself. More recently, we've heard the same thing about "serverless": If we're building serverless architectures, there are no servers, and hence no need for operations. Right?
Wrong. You can (and should) do as much as you reasonably can to automate operations. I don't think anyone would disagree with that. But don't think that you're eliminating Ops as a result. Operations never disappears; it changes shape.
What you're really doing when you automate everything and move it to the cloud isn't eliminating operations. You're increasing the size of the system that can be managed by a fixed-sized operations staff. You're giving staff the time they need to work on more serious, long-term problems rather than firefighting. Over the years, we've seen the ratio of computers to sysadmins go from one-to-many (more operators and administrators than computers) to thousands-to-one (many computers managed by a small staff). That's a good thing. Nobody in DevOps disagrees with that. But it doesn't make operations go away; there are still going to be traffic spikes, bugs, network failures, other infrastructure failures, security incidents, and so on.
The serverless computing (a.k.a. "Function as a Service") paradigm presents a slightly different set of issues, but the big picture is the same. When you break applications up into individual functions running in the cloud, what you really have is more servers, rather than fewer. The resulting architecture is likely to be more complex, not less. Yes, the cloud provider can do a lot to make your platform scale automatically, deal with failure, and all of that. But making sure the whole system works and is responsive is your job, and it isn't going away. This isn't to say that serverless isn't important; it is. It just doesn't make operations disappear. At its best, serverless enables a fixed-sized team to manage an even larger, more complex application.
SRE: It's not even called Ops
You'll probably see the term site reliability engineering (SRE). SRE is Google's flavor of DevOps. There are some insignificant squabbles about whether DevOps developed from SRE, or the other way around. Ignore that noise.
Compared to DevOps, SRE is more formal. It's certainly appropriate at Google scale. If you're not at Google scale, it's something to aspire to. It shouldn't surprise anyone that Google has some ideas about making extremely large software systems work. And one of the most important changes in DevOps over the last decade is the fact that Google is now talking openly about how they make their systems work. Even if you're nowhere near Google’s scale, take advantage of their work: learn from them and adapt what you can to your situation.
One of many rarely-stated themes in DevOps is the necessity of living with constant change. Continuous deployment, cloud infrastructure that’s allocated and deployed at a moment’s notice, microservice architectures: these are all great for the technical teams, but frequently leave the business side behind. How do you get alignment between the organization’s business requirements and its DevOps practice? How do you budget and plan in a world where everything is constantly changing? As Leon Fayer writes, the goal of DevOps isn’t to build cool stuff; it’s to support the business. That’s BizOps.
Even more Ops
I've also seen HROps, MarketingOps, FinanceOps, and several other Ops. These ideas tend to borrow a lot from the Lean Software movement. If there’s a single theme, it’s about helping these groups to work in an environment of constant change. How do you market when tomorrow’s product may be different from today’s? Even in traditional business, the reality of planning is that plans are necessary, but plans never survive their confrontation with reality: change is a constant. Can you turn distrust of change into an advantage by stressing agility and responsiveness? If you're interested, I strongly recommend Lean Enterprise and Lean UX.
How do you get an organization "doing DevOps"? "Doing DevOps" isn't that good a phrase, but there are some answers. Don't go out and hire a bunch of "DevOps specialists." Thinking about DevOps as a distinct specialty misses the point: DevOps is about getting existing groups working together, not about creating a new specialty or a new group. Definitely don't go out and pick new DevOps-friendly office furniture. (I won't give you the link, but that's not a joke.) Seriously: mistakes like these send the message that you don't know what you're looking for, and will drive away the people you really want to hire. Though you may end up with better office decor.
If you want to get started with DevOps, start small. It started as a grassroots movement, so let it be a grassroots movement at your company. Pick a project—perhaps something new, perhaps something that's been a problem in the past—and see what happens when you get the development and the operations groups working together. See what happens when you build pipelines for continuous deployment, use modern tools for monitoring, and get your developers and operations staff talking to one another. A top-down “We are now going to do DevOps across the company” message is likely to fail. Instead, start small, and let everyone see the results. The report DevOps in Practice shows how DevOps was introduced at Nordstrom and the Texas state government.
You don't need to start with the project that's most excited about doing DevOps—that might backfire. It's hard to evaluate your progress if everyone is heavily invested in seeing a new practice succeed. By the same token, you should probably avoid groups that are resistant to change, or that see ideas like DevOps as a fate worse than death. They'll only be convinced when they see results in other groups.
It doesn't matter a whole lot where you start, as long as you start. DevOps isn't about heroes, rock stars, ninjas, or unicorns; it's about regular developers and Ops staff getting out of their silos and working with each other. It’s about automating everything that can reasonably be automated, because doing so reduces the time spent fighting fires and increases the time you can spend improving your products and services. Ultimately, that’s what’s important.