Devops is a way of thinking and a way of working. It is a framework for sharing stories and developing empathy, enabling people and teams to practice their crafts in effective and lasting ways. It is part of the cultural weave that shapes how we work and why. Many people think about devops as specific tools like Chef or Docker, but tools alone are not devops. What makes tools “devops” is the manner of their use, not fundamental characteristics of the tools themselves.
In addition to the tools we use to practice our crafts, an equally important part of our culture is our values, norms, and knowledge. Examining how people work, the technologies we use, how technology influences the way we work, and how people influence technology can help us make intentional decisions about the landscape of our organizations, and of our industry.
Devops is not just another software development methodology. Although it is related to and even influenced by software development methodologies like Agile or XP, and its practices can include software development methods, or features like infrastructure automation and continuous delivery, it is much more than just the sum of these parts. While these concepts are related and may be frequently seen in devops environments, focusing solely on them misses the bigger picture—the cultural and interpersonal aspects that give devops its power.
What does a successful devops culture look like? To demonstrate this, we will examine the intersection of people, process, and tools at Etsy, an online global marketplace for handmade and vintage goods. We chose Etsy as an example not only because they are well-known in the industry for their technical and cultural practices, but also because Ryn’s experiences working there allow us a more detailed look at what this culture looks like from the inside.
A new engineer at Etsy starts her first day with a laptop and a development virtual machine (VM) already set up with the appropriate accounts for access and authorization, the most common GitHub repositories cloned, aliases and shortcuts to relevant tools precreated, and a guide with new hire information and links to other company resources on her desktop. Standardization of tools and practices between teams makes it easier for new people to get up to speed regardless of what team they are joining, but every team also has the flexibility to customize as they see fit.
A current employee will pair with the new employee to walk through what testing and development processes she will use in her day-to-day work. She starts by writing code on her local development VM, which is set up with configuration management to be nearly identical to the live production environment. This development VM is set up to be able to run and test code locally, so she can quickly work and make changes without affecting anyone else’s development work during these early stages.
Etsy engineers can achieve a good degree of confidence that a change is working locally by running a suite of local unit and functional tests. From there, they test changes against the try server, a Jenkins cluster nearly identical to the organization’s production continuous integration (CI) cluster but with the added bonus of not requiring any code to be committed to the master branch yet. When try passes, engineers have an even higher degree of confidence that their changes will not break any tests.
Depending on the size and complexity of the change, this new engineer might choose to open a pull request or more informally ask for a code review from one of her colleagues. This is not mandated for every change, and is often left up to individual discretion—in Etsy’s high-trust, blameless environment, people are given the trust and authority to decide whether a code review is necessary. Newer or more junior employees are given the guidance to help them figure out what changes merit a code review and who should be involved. As a new employee, she has a teammate glance over her changes before she starts deploying them.
When local and try tests have passed, the developer then joins what Etsy calls a push queue to deploy her changes to production. This queue system uses Internet Relay Chat (IRC) and an IRC bot to coordinate deploys when multiple developers are pushing changes at once. When it is her turn, she pushes her commits to the master branch of the repository she is working in and uses Deployinator to deploy these changes to QA. This automatically starts builds on the QA server as well as running the full CI test suite.
After the builds and tests have completed successfully, she does a quick manual check of the QA version of the site and its logs to look for any problems that weren’t caught by the automated tests. From there, she uses the same Deployinator process to deploy her code to production and make sure that the tests and logs look good there as well. If something does break that the tests don’t catch, there are plenty of dashboards of graphs to watch, and Nagios monitoring checks to alert people to problems. In addition, many teams have their own Nagios checks that go to their own on-call rotations, encouraging everyone to share the responsibilities of keeping everything running smoothly. And if something doesn’t run so smoothly, people work together to help fix any issues, and blameless postmortems mean that people learn from their mistakes without being chastised for making them.
This process is so streamlined that it takes only around 10 minutes on average from start to finish, and Etsy engineering as a whole deploys around 60 times a day. Documentation is available for anyone interested to peruse, and every engineer pushes code to production on their first day, guided by a current team member to help familiarize them with the process. Even nonengineers are encouraged to participate by way of the First Push Program, pairing with engineers to deploy a small change such as adding their photo to the staff page on the website. In addition to being used for regular software development, the try and Deployinator process works so well that it is used for nearly everything that can be deployed, from the tools that developers use to build virtual machines to Kibana dashboards for searching logs, from Nagios monitoring checks to the Deployinator tool itself.
This story of Etsy today is in stark contrast to how things were several years ago with a less transparent and more error-prone deployment process that took close to four hours. Developers had their own blade servers to work on, rather than virtual machines, but the blade servers weren’t powerful enough to run the automated test suites to completion. Tests that were run in the staging environment took a couple of hours to complete, and even then they were flaky enough to make their results less than useful.
Teams within the engineering organization were siloed. There were a lot of developers throwing code over the metaphorical wall to ops, who were solely responsible for deployments and monitoring and thus tended to be incredibly change-averse. Developers wrote code, executing handcrafted shell scripts to create a new SVN branch which would be deployed using
svn merge—not known for being the easiest merging tool to work with—to merge in all the developers’ changes to this deploy branch. Developers would tell the one operations engineer who had permissions to deploy which branch to use. This would begin the painstaking, multihour deployment process (see Figure 1-1). Because the process was so painful, it happened only once every two or three weeks.
People became fed up with this process. They realized something had to change. It would be hard for the deployment situation to get any worse. The organization was made up of smart and talented people with the motivation and now frustration to solve this problem. They had buy-in from the CEO and CTO down, which was key to having resources available to make change.
The keys to the deployment kingdom were shared from the one operations engineer to two developers, who were given the time and go-ahead to hack on this process as much as they needed. They say when you have a hammer, everything looks like a nail, and when you have a web application developer, everything looks like it needs a web app—and so the first Deployinator was born (Figure 1-2). At first, it was a web wrapper around the existing shell scripts, but over time, more people began to work and improve it. While the underlying mechanisms doing the heavy lifting changed, the overall interface stayed largely the same.
It became evident to everyone that empowering these employees to create tools to improve their jobs made it a lot easier for people to do said jobs. Deploys went from being in people’s way to helping them accomplish the goal of getting features in front of users. Tests went from being flaky wastes of time to helping catch bugs. Logs, graphs, and alerts made it possible for everyone, not just a select few individuals, to see the impact of their work.
The key takeaway from the stories of all these tools isn’t so much the specifics of the tools themselves, but the fact that someone realized that they needed to be built, and was also given the time and resources to build them.
Buy-in from management, freedom to experiment and work on non-customer-facing code, and trust across various teams all allowed these tools to be developed and along with them, a culture of tooling, sharing, and collaboration.
These factors made Etsy the so-called devops unicorn (they prefer to call themselves just a sparkly horse) they are known as today, and maintaining that culture is prioritized at all levels of the company. This example embodies our definition of effective devops—an organization that embraces cultural change to affect how individuals think about work, value all the different roles that individuals have, accelerate business value, and measure the effects of the change. It was these principles of devops that helped take Etsy from a state of frustration and silos to industry-renowned collaboration and tool builders. While the details may differ, we have seen these guiding principles in success stories throughout the industry, and in sharing them we provide a guide for organizations looking to make a similar transformation.
Stephen Toulmin, Human Understanding
Effective DevOps contains case studies and stories from both teams and individuals. When looking at the devops books that already existed, we found that there were fewer direct real-world experiences for people to draw from; so many stories focus on a specific tool or an abstract cultural practice. It is one thing to talk about how something should work in theory, but there is often a great deal of difference between how something should work in theory and how things turn out in practice. We want to share the actual stories of these practices, what did and didn’t end up working, and the thought processes behind the decisions that were made, so that people have as much information as possible as they undertake their own devops journey.
My devops story begins around the same time as the devops movement itself. I made an accidental career change into operations shortly after the idea of devops was first formalized and the first devopsdays conference events had been held. As luck would have it, I found myself working as a one-person operations team at a small startup in the ecommerce space, and discovered I loved operations work. Even though I was a one-person team for so long, the idea of devops immediately made sense to me—it seemed to be a common sense, no-bullshit approach to working more effectively with other parts of an organization outside of your own team. At the time, I was the grumpy system administrator who spent their days hiding in the data center. I was the only person on-call, and was so busy fighting the fires that I had inherited that I had little to no idea what the developers (or anyone else, for that matter) were working on. So the ideas of sharing responsibilities and information and breaking down barriers between teams really spoke to me.
Some organizations are much more open to change and new ideas than others, and in addition to being rather change-resistant, this startup wasn’t keen on listening to anything a very junior system administrator had to say. “You’re not even a real sysadmin,” they said as a way of dismissing my ideas. There also wasn’t enough budget to even buy me a couple of books (I bought Tom Limoncelli’s The Practice of System and Network Administration and Time Management for System Administrators on my own dime, and they were worth every penny), let alone send me to LISA or Velocity, and devopsdays New York wouldn’t be started for another couple of years.
Luckily, I started to find the devops community online, and being able to talk to and learn from people who shared my passions for operations and learning and working together revitalized me. James Turnbull, now CTO of Kickstarter but who was at the time working at Puppet, found me on Twitter, struck up a conversation with me, and sent me a copy of his excellent Pro Puppet book at a time when I was struggling with having inherited 200 snowflake servers with not even a bash script to manage them. That small action introduced me to a growing and thriving community of people, and gave me hope that I could one day be a part of that community.
As Jennifer will note in her story as well, it’s hard to effect change when you’re burnt out, and after a year or so of trying to help change and improve my current company and being shut down for my lack of experience (which was growing by the day) every time, I also made the choice to move on. I kept learning and growing my skills, but still didn’t feel totally aligned with the places I ended up working. I still seemed to be fighting against my coworkers and organizations, rather than working with them.
In January 2013, I went to the first devopsdays New York, soaking up all the talks and listening to as much of the hallway track as I could, even if I didn’t feel like I was experienced enough to add anything to the conversations. I lived vicariously through following #VelocityConf on Twitter. In October of that year, I gave a lightning talk at the second devopsdays New York, and through that met Mike Rembetsy, one of the co-organizers of that conference. He told me I should come work for Etsy, but after years of feeling like an impostor in my own field, I thought he was joking. I’d followed Code as Craft and the Etsy operations team online since I first discovered the operations and devops communities, but I didn’t think I was good enough to join them.
I’ve never been more happy to be proven so wrong. My career in operations has taken me on a journey through several different organizational structures and ways that development, operations, and sometimes even “devops” teams worked together. Having worked at companies as small as a 25-person startup and as large as a decades-old enterprise organization with hundreds of thousands of employees, I’ve seen many ways of developing and delivering software and systems—some more effective than others.
Having spent time being the one person on-call 24/7 for the better part of a year and in other less-than-ideal workplace scenarios, I want to share the techniques and methods that have worked best for both me and my teams over the years to help reduce the number of people who might also find themselves being the single point of failure for part of their organization. A large part of my motivation for writing this book was being able to tell these stories—both ones that I’ve lived personally and ones told to me by others—so that we can share, learn, and grow as a community. Community helped me get to where I am today, and this book is just one way I have of giving back.
In 2007, I was contacted by Yahoo management about a position that was “a little dev” and “a little ops”—a Senior Service Engineer position building out a multitenanted, hosted, distributed, and geographically replicated key/value data store called Sherpa.
As a service engineer at Yahoo, I honed my skills in programming, operations, and project management. I worked alongside the development and quality assurance teams building Sherpa, coordinating efforts with the data center, network, security, and storage teams. In 2009, as the murmurings of devops trickled into Yahoo, I discounted its value, as I already was a devops!
Fast-forward to the summer of 2011, and Jeff Park took over the leadership of my team. He helped grow the team so that we had multiple people in Service Engineering in the United States as well as in India. It wasn’t enough. Jeff had concerns about me as an employee working nonstop and almost single-handedly keeping the service up. He was also concerned about the business and wanted to build resiliency into the support model through redundant staff support. In December, he told me to take a real vacation—to not read my email or take phone calls but to disconnect.
I came back from vacation to a degraded service. So many small issues that I had found over the years impacted the overall event, making it more difficult to debug. I felt like a complete failure, even though my last-minute visualization had been critical to identifying and monitoring the issue.
Jeff took me aside and said that he knew there was a high risk that things would break while I was out and additional risk of issues resulting from the team’s historical complete reliance on me. My heroics were disguising the failures inherent in the system.
He felt that sometimes short-term setbacks are an OK thing, if you later use them as a lesson to right things over the long term. If things broke, it would help prioritize the criticality of further sharing, documenting, and distributing my knowledge and expertise to the business. Ultimately, that would lead to more stability and a better overall outcome for the organization and individuals on the team.
That event united the Sherpa team as we tried to repair the service and understand what happened. We divided into cross-functional teams to address the different components of the problem: failure handlers, communication group, tools, monitoring, and cleanup crew. Key individuals from management were available at all times and prepared to make hard decisions. These hard decisions helped us limit the overall length of the outage.
Bob Sutton, Stanford Management Instructor
A key takeaway from this event for me was the value of failure. We couldn’t be afraid to fail, and we needed to learn from the failure. We had ongoing meetings to resolve the operational issues that had been highlighted in the event. We continued to remediate outages as a cross-discipline team rather than limiting those activities to the Service Engineering team. We promoted discussions between our consumers and providers to better understand the weak points in the system.
Having spent over 10 years building work practices based on operations’ tribal cultures of long hours, isolation, and avoiding system failures, how was I going to evoke the change I needed?
I was ready for devops. For me, the value of devops wasn’t the mantra of “devs do X and ops do Y, or dev versus ops”, but the shared stories, solving problems in a collaborative manner across the industry, and strengthening community. From the open spaces to the collaborative hacking, a new support system was emerging that strengthened the foundations of sustainable workplace practices and cultivated relationships between people.
Collaborating with Ryn on this book has strengthened my understanding of devops. Being able to share the working strategies and techniques from around the world to help improve and create sustainable work practices has been an incredible journey. It doesn’t end with the final words in this book.
We are all gaining so much experience each and every day based on our different, diverse perspectives. Whether you are at the beginning of your career, knee-deep in cultural transformation, or about to change roles and responsibilities, your experiences can inform and educate others. I look forward to hearing and amplifying your stories so that together we grow as a community and collectively learn from our failures and successes.
We have selected a variety of case studies to help illustrate the different ways in which a culture of effective devops can manifest. The goal of these stories is not to provide templates that can be followed exactly; blindly copying another organization or individual discounts all of the circumstances and reasoning that went into the choices they made.
These stories are illustrations or guides. Our hope is that you read these stories and see reflections of your experiences, perhaps as they are today, but maybe as how they could be in the future. We included stories from a variety of sources, both formal case studies and informal personal stories. While some stories are from organizations that are more well known, we deliberately chose to include stories from lesser-known sources as well, to showcase the variety of devops narratives that exist.
When you read these studies, consider not only the choices that were made and their outcomes, but also their circumstances and situations. What similarities can you see between their circumstances and your own, and what are the key differences? If you made the same choice in your own organization, what factors that are unique to your workplace would change the outcome? By reading and understanding these stories, we hope that you will be able to see their underlying themes, and start applying them to your own devops narrative.
Learning shouldn’t stop at these shared stories. Experiment with new processes, tools, techniques, and ideas. Measure your progress, and most importantly, understand your reasons why. Once you start realizing what does and doesn’t work with the things you try, you can begin to do more sophisticated experiments.