Chapter 1. Introduction: The Shift to the Cloud
Thomas Edison, as we all know, is credited not only for inventing the light bulb but for commoditizing electricity. Many other scientists also contributed, both before and after Edison, but it was Edison’s assistant, Samuel Insull, who built a business model that would commoditize electricity and make it available as a service. Insull came up with the concept of the power grid, which enabled economies of scale that made electricity available to factories and other businesses as a utility.1 Today, we pay for electricity based on consumption: the more you use, the more you pay; the less you use, the less you pay. If this sounds like your cloud experience, read on.
Before electricity was a public utility, companies had to create and manage their own electricity, with waterwheel mills or hydraulic generators located close to their assets that needed power. These were incredibly expensive, closed systems. Only the richest could produce enough power to run product assembly lines, warehouses, offices, and sweatshops. The operating model was very simple back then. Such firms usually employed a VP of Electricity, who managed a staff of skilled electricians and generator operators. This group owned the power, and all other parts of the company were consumers of that power and at the mercy of permissions from the power provider.
The introduction of the power grid changed all this. Now any company of any size could access the same power grid at the same cost, without the overhead of purchasing and managing their own power generators. This was a game changer. Now, power (computing) was available for more purposes and was used for multiple applications, not just for a single purpose. Corporations could now automate assembly lines, appliances, and numerous other electrical devices at a fraction of the cost.
New inventions spawned everywhere, disrupting industries and paving the way for new business models and products. While investors and business owners embraced these innovations, workers were not always as excited—after all, when was the last time you hired a milkman, used an ice delivery service for refrigeration, or saw a lamplighter at dusk? Electricity displaced these workers. Nor were the VPs of Electricity and their domain experts all that excited to see electricity become available simply by plugging a cord into an outlet in the wall. What was going to happen to their jobs?
It is easy for us today to look at the invention of the power grid as a no-brainer. Of course, companies would quickly adopt and embrace power as a utility instead of maintaining their own power departments and equipment. But the shift to the grid did not happen overnight. Large companies had made big investments in legacy technologies; they needed time and money to transition to the new model. Of course, VPs of Electricity fought tooth and nail against the new model—how could they give up control of something so critical to a third party? Over time, it no longer made sense to build and maintain generators. Companies migrated from the old do-it-yourself (DIY) electricity model to the pay-as-you-go electricity model, and their operating models changed to reflect new business processes.2
Even once everyone was on board, migration was a long, hard road. Many companies focused their efforts entirely on technical aspects. But the legacy method of owning and operating electricity came with years of best practices—and corresponding processes that enforced those best practices. The old operating model put the electricity department at the center of the universe: all other departments had to go through it to request power, and brought their old processes and operating model with them to the new era of electricity. Even though power consumers could now access power instantly, they were still forced to go through the electric department for permission, fill out the same forms, attend the same review meetings, and satisfy the same checklists—because “that’s how we do it here.” So even though power was widely available, it still took a long time for power consumers to get the access they needed. This approach made it hard for companies to realize the benefits and business advantages that the power grid should have brought them.
If any of this sounds like a cloud transformation that you have been a part of or have heard about, there’s good reason for that. A century later, as we move from running on physical infrastructure to consuming infrastructure as a service, many cloud transformations are stalling for similar reasons. Too many companies focus their cloud adoption strategy solely on the technology, with little to no consideration for redesigning their legacy operating models and business processes. They, too, find it difficult to realize the advantages of cloud as a utility.
That’s why I wrote this book: I want your cloud adoption journey to be different. I’ll show you how to calibrate not only your tech but your people and your business processes to take maximum advantage of the cloud and move past outdated ways of doing things, so your cloud transformation can begin creating value for you much faster, and with fewer roadblocks.
Specialization and Silos of Knowledge
The shift from private power generation to the grid has a lot in common with the shift from the mainframe computing of the 1960s and 1970s to today’s cloud computing. For several decades, mainframes were the sole source of computing infrastructure for most companies. With the birth of minicomputers, servers, and personal computers, however, work could be distributed on smaller machines at a fraction of the cost. New skills and domain expertise were required to manage all the new hardware and new operating systems. Networks and storage devices became the norm. Organizations experimented with different, more horizontal operating models to contain the sprawl of technology. They sought to manage the risks of change by adding review boards and gates, centers of excellence, and other processes. The mainframe teams were no longer the center of the universe; no longer did all standards, permissions, and change management go through them. Each technology domain now had its own standards, access controls, and change-control processes.
One result of this change in IT system complexity was that domain knowledge became walled off into “silos” of specialization. Each team was measured on its own goals and objectives, which often conflicted with those of the other groups that consumed or supplied services to that group. Each team thus built processes around inputs (requests for service) and outputs (delivery of service) to its organization, in the hope of having more control over achieving its goals and objectives. For example, the security team would create request forms, review processes and best practices to which other departments would have to adhere; if there was a problem, the security team could point to its process as proof of its due diligence. The governance team had its own processes, as did the change management team, the project management team, the quality assurance team, the operations team, and so on.
This model served its purpose well when software was built as large, monolithic applications that were deployed on physical infrastructure and planned in quarterly or biannual release cycles. As inefficient as it was for a development team to navigate through all of the mostly manual processes across the numerous silos, release cycles were long enough to allow for these inefficiencies.
Today, however, speed to market is more of a competitive advantage than ever; customers expect new features and fixes much more frequently than before. Companies that stay mired in the ways of the past risk becoming the next Blockbuster Video, their popularity vanishing into obscurity as the world moves on without them. The 2019 State of DevOps Report concluded that top-performing teams that employed modern DevOps best practices deployed 208 times more frequently, had lead times 106 times from commit to deploy, resolved incidents 2,604 times faster, and had a rate of change failure 7 times lower than teams that did not embrace DevOps.
Cloud computing can enable the agility that so many companies seek, but cloud technology by itself is not enough. To keep up and to create value from that agility, companies must move away from the “VP of Electricity” model of doing business and transform to new ways of working.
Today’s chief information officers (CIOs) have an extremely tough job: they have to balance “keeping the lights on” (that is, keeping the money flowing) with improving agility and quality, and investing in new technologies. Pick up any trade magazine and you will see success stories of large companies adopting emerging technologies such as cloud computing, machine learning, artificial intelligence, blockchain, DevOps, Scaled Agile Framework (SAFe), and site reliability engineering (SRE). Each new trend is designed to solve a specific set of problems, but it takes a combination of tools, trends, methodologies, and best practices to deliver cloud computing at scale.
Even as CIOs embrace cloud computing and adopt many of these new technologies and methodologies, they must work within the policies set forth by their governance, risk and compliance (GRC) team and their chief information security officer (CISO), neither of which are known for welcoming change in most companies. The GRC and CISO have a strong incentive to make sure the company never ends up on the front page of the Wall Street Journal for a breach or system failure. At the same time, the CIO is asked to deliver more value faster. These are competing priorities, and to fulfill them, many organizations are shifting traditionally domain-specific functions like testing, security, and operations to software engineering teams and even business units.
The challenge this presents is that many engineers are not sufficiently skilled to take on these new roles effectively. It only takes one incident—say, a server with an open port to the web—to cause a CISO to lock everything down, to the point where nobody can get timely work done in the cloud. When domain expertise shifts without the company rethinking existing organizational structures, roles, responsibilities, and processes, the end result is usually undesirable—and sometimes even catastrophic.
Patterns and Antipatterns in Cloud Adoption
To embrace the cloud and create the capabilities to build and run software in it at scale, IT leaders need to step back and redesign their organizations around the cloud. We must rethink the entire software-development value stream, from ideation to ongoing production.
No two cloud transformations are the same, but the patterns for success and the antipatterns of failure are very common. Companies that succeed in the cloud do so after many tough lessons. Nobody gets it right at the beginning. But if you start your transformation expecting some bumps and bruises along the way, you can get off the sidelines and start making progress. Your culture must embrace transparency and continuous learning, and you should expect to adjust and improve constantly.
At tech conferences like AWS re:Invent, Google Cloud Next, or DevOps Enterprise Summit, you’ll hear plenty of success stories. Those who haven’t achieved that level of success can get disheartened because it can seem like all the other companies are getting it right. Don’t be fooled: most success stories represent a product line or business unit within a very large organization, not the entire organization. Other parts of their organization may still be in the very early stages. Keep your chin up. This book will share lessons about what to do and, more importantly, what not to do as you embark on your cloud journey.
What’s more important than getting it right at the beginning? Actually starting. Too many organizations get so caught up in trying to create the perfect low-risk strategy, changing CIOs and consulting partners constantly, that they never actually begin doing the work. They have nothing more than years of strategy documents and PowerPoint decks to show for their efforts, while their competitors keep advancing across the cloud maturity curve.
Organizations that get stuck at this stage tend to see the cloud not as a transformation, but as a technology project. Some companies are so conservative that they put too many restrictions on moving forward with any significant effort in the cloud. This might be more of a failure than moving to the cloud and running into problems with availability and resiliency. At least with the latter, you’re gaining experience and increasing your maturity.
When companies don’t recognize the need to transform themselves and to build, operate, and think about software differently, they take their old business processes, tooling, and operating model with them to the cloud—which almost always results in failure.
I’ve been consulting on cloud adoption since 2013, and I’ve seen just about every customer request you can imagine, from companies at all levels of cloud maturity. To capture this variation, I created the maturity curve in Figure 1-1. What this image shows is that when most organizations start their cloud journey, they focus on the ROI of moving to the cloud. At this point early in their journey, they think of the cloud in the same context as the datacenter: they’re thinking about servers instead of services. The value they can get from this mindset is low in comparison to the value that can be achieved in the cloud. After gaining experience building and running applications in the cloud, they start to move up the stack and leverage platform as a service (PaaS) solutions or fully managed services from the cloud providers, like database as a service. This allows them to achieve better speed to market and more operational efficiencies. As they continue to move up the stack and start embracing cloud native and serverless architecture concepts, they start creating business value at high speed. At this level of maturity, the full promise of cloud can be realized. The problem is, very few get past the ROI analysis and infrastructure as a service (IaaS) mindset to come close to achieving the desired ROI.
When I first started, most of my clients requested either an overall cloud strategy or wanted my analysis of either total cost of ownership (TCO) or return on investment (ROI) for a cloud initiative. At the time, convincing their CEOs and boards that cloud computing was the way forward was a hard sell for IT leaders. About 80% of the requests were focusing on private cloud, while only 20% were for the public cloud, almost exclusively Amazon Web Services (AWS). In November 2013, at its annual re:Invent conference, AWS announced a wide variety of new enterprise-grade security features. Almost immediately, my phone began ringing off the hook with clients looking for advice on public cloud implementations. A year later, those clients’ work requests had completely flipped, with over 80% for public cloud and 20% for private cloud.
As public cloud adoption increased, companies moved to the cloud or built new workloads in the cloud much faster than they had traditionally deployed software. Two common antipatterns emerged.
The Wild West
Developers, business units, and product teams now had access to on-demand infrastructure, and they leveraged it to get their products out the door faster than ever. They had no guidelines or best practices, and development teams took on responsibilities they’d never had before. Rather than developing a systematic approach and implementing it across the organization, though, many companies simply left cloud decisions to individual parts of the organization: a lawless, “Wild West” approach.
Here is a tale of two companies. Alpha Enterprises (as I’ll call it) had five business units (BUs), each with its own development and operations teams. The centralized IT team had always provided infrastructure services to the BUs, which were extremely dissatisfied with IT’s long lead times and lackluster customer service. The BUs looked at cloud computing as an opportunity to divorce themselves from IT and speed up their delivery times. They all had early successes deploying their first application or two in the cloud. But as they added more applications, they were woefully unprepared to support them. Customers started experiencing lower levels of reliability than they were accustomed to.
Then, one day, the dangerous Heartbleed bug was discovered.3 The security and operating-system teams scrambled to patch impacted systems across the organization—but they had no visibility into the exposure of the cloud-based systems the BUs had built. It took several weeks for the security team to access and fully patch the vulnerability in those systems. Months later, security performed an assessment and found two more systems that had never been patched.
BetaCorp, on the other hand, had a central IT team that built and managed all of its approved operating systems. The BUs leveraged a standard build process that pulled the latest approved operating system from the central team’s repository. When the bug was discovered, the central team updated its operating-system images and published the new version. The BUs simply redeployed their applications, which picked up the latest patched version of the operating system, and the vulnerability was eliminated that same day across all of BetaCorp’s cloud applications.
Part of the problem at Alpha Enterprises, and companies like it, is that each BU is “reinventing the wheel”: researching, buying, and implementing its favorite third-party tools for logging, monitoring, and security. They each take a different approach to designing and securing the environment. More than a few also implement their own continuous integration/continuous delivery (CI/CD) toolchains with very different processes, resulting in a patchwork of tools, vendors, and workflows throughout the organization.
This has both positive and negative consequences. Companies like Alpha Enterprises deliver value to their customers faster than ever before—but often expose themselves to more security and governance risks than before, as well as deliver less resilient products. This lack of rigor and governance makes production environments unpredictable and unmanageable.
Command and Control
The opposite of the freewheeling “Wild West” antipattern was a military-style, top-down, command-and-control approach. In these companies, groups that were highly motivated to keep things in line—such as management, infrastructure, security, and GRC teams—put the brakes on public cloud access. They built heavily locked-down cloud services and processes that made developing software in the cloud cumbersome. These processes were often decades old, designed during the period when deployments occurred two or three times a year and all infrastructure consisted of physical machines owned by a separate team.
Let’s look at another example. A well-established healthcare company I’ll call Medical Matters acquired an up-and-coming startup, CloudClaims. CloudClaims had created a cloud-based claims processing application that automated the age-old paper claims processes that were still standard in the industry. Instead of taking weeks, CloudClaims provided same-day claims completion. When Medical Matters’ security and risk teams assessed the new technology their company had acquired, they were appalled to find out that the same team that built the code was deploying it into production. They took that responsibility away from the CloudClaims staff and mandated that they follow the standard, proven process that had been in place for two decades at Medical Matters.
Suddenly, the deployment rate dropped from three times a day to once a month. What used to be a fully automated process now had to be broken into steps to allow for architecture and security review meetings, a biweekly change-control board meeting, and email approvals from two levels of executives. The CloudClaims developers challenged these processes, even showing the executives why their process was less risky than the process that they were being forced to use. Medical Matters would not budge. Eventually, key CloudClaims team members left Medical Matters. The product itself started to lose its value, because it could no longer respond to the market demand at the rate it once had.
Medical Matters’ approach destroys one of the key value propositions of the cloud: agility. I have seen companies where it took six months to provision a virtual machine in the cloud—something that should take five minutes—because the command-and-control cops forced cloud developers to go through the same ticketing and approval processes required in the datacenter.
This approach created very little value even as companies spent huge sums on strategy and policy work, building internal platforms that did not meet developers’ needs. Worse yet, this approach created an insurgent “shadow IT,” as it did at Alpha Enterprises: groups or teams began running their own mini-IT organizations to get things done because their needs were not being met through official channels.
These antipatterns have raised awareness of the need to focus on cloud operations and to invent a new cloud operating model. Since 2018, my clients have been clamoring for assistance in modernizing their operations and designing new operating models. Many are a few years into their journey.
At the start of the cloud adoption journey, enterprises focus a lot of attention on cloud infrastructure. They learn a lot in this phase, improving their technical skills for building software and guardrails in the cloud. They often start at the IaaS layer, because years of working with physical infrastructure have made them comfortable dealing with infrastructure. As the enterprise’s cloud experience matures, they begin to realize that the true value of cloud is higher up in the stack. That’s when they look into PaaS and software as a service (SaaS).
At the same time, development shops have been embracing high levels of automation and leveraging concepts like CI/CD. This book will show how concepts like DevOps, cloud-native architecture, infrastructure as code, and cloud computing have changed traditional operations.
The Datacenter Mindset Versus the Cloud Mindset
When you start building in the public cloud, you are basically starting from scratch: no existing cloud datacenter, no guardrails, no financial management tools and processes, no disaster recovery or business continuity plan, just a blank canvas. The conventional wisdom is to just use the tools, processes, and organizational structures you already have, from the datacenter, and apply them to the cloud. That’s usually a recipe for disaster.
When applications are moved, refactored, or built new on the cloud, they are being deployed to a brand-new virtual environment that is radically different from the datacenter environments that people are used to. The processes and policies governing how work gets done in a datacenter have typically evolved over many years. Along with these legacy processes comes a whole host of tools that were never intended to support software that runs in the cloud. If these tools are not cloud native, or at least “cloud friendly,” getting them to work effectively (or at all) in the cloud can involve a painful integration period. This creates friction for getting software out the door. It can also create unnecessary complexity, which can increase costs, reduce performance, and even reduce resiliency. All of this makes it challenging—and sometimes impossible—to automate software build-and-release processes from end to end.
Some of the questions IT teams need to ask when designing a cloud strategy include:
What should we do when incidents, events, or outages arise?
What processes should we follow to deploy software?
What’s the technology stack for the products we’re building and managing?
What processes should we follow to introduce new technology?
Let’s look at a few examples. In the command-and-control antipattern, one common desire is to keep existing on-premises logging solutions in place instead of moving to a cloud-native solution. If you do this, all logs must be sent from the public cloud back to the datacenter through a private channel. You’ll incur data transfer costs and create an unnecessary dependency on the datacenter. What’s more, these legacy logging solutions often have dependencies on other software solutions and processes, which in turn create unnecessary (and sometimes unknown) dependencies between the cloud and the datacenter that can cause cloud outages.
Here is another example. My team conducted an assessment of a client’s tools. We recommended tools that would work well in the cloud and advised them on which existing tools should be replaced by a more cloud-suitable solution. One tool we recommended replacing dealt with monitoring incoming network traffic. The group that managed the tool dug in and refused: they were comfortable with the old tool and didn’t want to have to manage two tools. This created a single point of failure for all of the applications and services running in that company’s public cloud. One day the tool failed—and so did all of that company’s cloud applications.
The lesson here is that clinging too closely to tools that are not well suited for the cloud will hamper your cloud adoption efforts and lead to avoidable errors and outages. Instead of sticking to what’s comfortable, work to reduce the number of datacenter dependencies, and have a plan to mitigate any failures.
As companies rethink their approach to the cloud, a new operating model that brings domain experts closer together can reduce these incidents.
Enterprises that have been building and running datacenters for many years often have a challenge shifting their mindset from procuring, installing, maintaining, and operating physical infrastructure (the “VP of Electricity” mindset) to a cloud mindset, where infrastructure is consumed as a service. Table 1-1 shows some of the mindset changes required to leverage the cloud. To be real, the items on the right for the cloud native approach are not things you get on day one of your cloud journey. They represent what you should strive for and work toward adopting over time. But if your team is stuck in the datacenter design mindset, you will lose a lot of the value of the cloud.
|Legacy datacenter approach||Cloud-native approach|
|Procure new infrastructure||Pay for consumption|
|Rack and stack infrastructure||Run automated scripts|
|Patch servers||Destroy and redeploy in CI/CD pipeline|
|Service requests for infrastructure||Enable self-service provisioning|
|Scale vertically||Scale horizontally|
|Plan for hardware refresh every 3-5 years||Does not apply|
|Multiple physical disaster recovery sites||Real-time disaster recovery across zones and regions|
|Networking appliances||Networking APIs|
|Multiple approvals and review gates||Push-button deployments|
Use What You Need, Not Just What You Have
Before cloud computing, almost all of the development I was involved in was deployed within datacenters that my company owned. For each piece of the technology stack, a specialist in that technology took responsibility. A team of database administrators (DBAs) installed and managed database software from vendors like Oracle, Microsoft, and Netezza. For middleware, system administrators installed and managed software like IBM’s Websphere, Oracle’s WebLogic, and Apache Tomcat. The security team took responsibility for various third-party software solutions and appliances. The network team owned physical and software solutions.
Thus, whenever developers wanted to leverage a different solution from what was offered in the standard stack, it took a significant amount of justification. The solution had to be purchased up front, the appropriate hardware procured and implemented, contractual terms agreed upon with the vendor, annual maintenance fees budgeted for, and employees and/or consultants trained or hired to implement and manage the new stack component.
Adopting new stack components in the cloud can be accomplished much more quickly, especially when these stack components are native to the CSP—if you don’t let legacy thinking and processes constrain you. For example:
No long procurement process is necessary if a solution is available from the CSP as a service.
No hardware purchase and implementation is necessary if the service is managed by the CSP.
No additional contract terms should be required if the proper master agreement is set up with the CSP.
There are no annual maintenance fees for each service thanks to the pay-as-you-go model. The underlying technology is abstracted and managed by the CSP, so new skills are only needed at the software level (for example, learning how to consume the API).
Let’s say that Acme Retail, a fictitious big-box retailer, has standardized on Oracle for all of its online transaction processing (OLTP) database needs and Teradata for its data warehouse and NoSQL needs. A new business requirement comes along that requires a document store database in the next four months.
In the old model, adopting document store databases would require new hardware, software licensing, disk storage, and many other stack components. Acme employees would have to get all of the relevant hardware and software approved, procured, implemented, and secured, at significant effort and expense. In addition, Acme would need to hire or train DBAs to manage the database technology.
Now let’s look at how much simpler this can be in the public cloud. Acme is an AWS shop, and AWS offers a managed service for a document store database. Most of the steps mentioned above are totally eliminated. Acme no longer needs to worry about hardware, software licensing, additional DBAs to manage the database service, or new disk storage devices—in fact, it doesn’t need any procurement services at all. All Acme needs is to learn how to use the API for the document store database, and it can start building its solution.
Let’s say that Acme hires a consulting team to deliver the new feature. The consultants recommend purchasing MongoDB as the preferred document store database to satisfy the requirements to store and query documents. Acme has no prior experience with MongoDB, which means it will have to go through the procurement process. However, within Acme’s current set of processes, there is no way to get approvals, procure all of the hardware and software, train or hire DBAs, and implement the database in just four months. Therefore, Acme decides to leverage its existing Oracle database, a relational database engine, to solve the problem. This is suboptimal because relational databases are not a great solution for storing and retrieving documents. Document store databases were built specifically for that use case. But at least Acme can meet its deadline by leveraging existing database technology.
This decision process repeats itself over and over from project to project: Acme keeps settling for suboptimal solutions due to the constraints of its legacy processes. The technical debt just keeps mounting.
Now let’s see how different this can all be if Acme decides to embrace a database-as-a-service solution in the public cloud.
After doing some testing in a sandbox environment in the cloud, the consultants determined that the document store managed service on our favorite CSP’s platform is perfect for Acme’s requirements. They can start building the solution right away because the database is already available in a pay-as-you-go model, complete with autoscaling.
Leveraging stack components as a service can reduce a project’s timeline by months. It allows you to embrace new technologies with a lot less risk. Perhaps most importantly, you no longer have to make technology compromises because of the legacy challenges of adopting new stack components.
Consuming stack components of a service provider provides greater flexibility for architects. It is important for all IT domains to understand this. If they don’t, there is a good chance that they’ll end up forcing legacy constraints on their cloud architects and wind up building suboptimal greenfield solutions in the cloud that create new technical debt.
One of the key messages of this book is that you cannot achieve success in the cloud by focusing only on cloud technology. To succeed at scale in the cloud, enterprises must make changes not only to the technology, but to the organization structures and the legacy processes that are used to deliver and operate software. Embracing DevOps is a key ingredient to successfully transforming the organization as it adopts cloud computing. But what is DevOps, really?
One of the biggest misperceptions about the term DevOps is that it is a set of technologies and tools that developers and operators use to automate “all the things.” DevOps is much more than tools and technologies, and it takes more than just developers and operators to successfully embrace DevOps in any enterprise. Many people will shrug off this debate as nothing more than semantics, but understanding DevOps is critical for any organizational strategy. If you see DevOps as no more than automating CI/CD pipelines, you will likely leave out many important steps required to deliver in the cloud at scale.
There are many ways to define DevOps. Back in 2014 I defined it as “a culture shift or a movement that encourages great communication and collaboration (aka teamwork) to foster building better-quality software more quickly with more reliability.” I went on to add that “DevOps is the progression of the software development lifecycle (SDLC) from Waterfall to Agile to Lean and focuses on removing waste from the SDLC.”
But don’t take my word for it; look at the work of the leading DevOps authors, thought leaders, and evangelists. Gene Kim, coauthor of popular DevOps books such as The Phoenix Project, DevOps Handbook, and The Unicorn Project (all IT Revolution Press), defines it as:
The set of cultural norms and technology practices that enable the fast flow of planned work into operations while preserving world class reliability, operation and security.
DevOps is not about what you do, but what your outcomes are. So many things that we associate with DevOps fits underneath this very broad umbrella of beliefs and practices . . . of course, communication and culture are part of them.4
Buntel and Stroud, in their book The IT Manager’s Guide to DevOps (XebiaLabs), define DevOps as “a set of cultural philosophies, processes, practices, and tools that radically removes waste from your software production process.”5 Similarly, The DevOps Handbook asks us to “imagine a world where the product owners, Development, QA, IT Operations, and Infosec work together, not only to help each other, but also to ensure that the overall organization succeeds.”
The new ways, methods, and paradigms . . . to develop software, with a focus on Agile and Lean processes that extended downstream from development and prioritized a culture of trust and information flow, with small cross-functional teams creating software.
- As we strive to improve speed to market, we must not sacrifice quality along the way. The users of our products and services expect those products and services to work. The more things don’t function as expected, the lower overall customer satisfaction will be. In addition, quality issues lead to unplanned work, which can lead to long hours, high pressure to fix critical issues quickly, lower productivity, and burnout. DevOps aims to ensure high levels of quality throughout the entire SDLC to create better products and services, and happy customers and workers.
Silo structures breed a lack of trust between the silos, which typically have conflicting priorities. For example, the security team focuses on reducing threats, the testing team on finding defects, the operations team on stability, and the development team on speed to market. Each silo builds processes for interacting with the other silos with the goal of improving the likelihood of meeting the year’s objectives. So the security team adds request forms, review meetings, and standards, seeking visibility into potential risks in the software and infrastructure being introduced to the system. The problem is that, often, the security team does not consider the other team’s goals and objectives. The same holds true for the other silos. Meanwhile, development builds processes to expedite building and deploying software to production. This directly conflicts with the testing team’s goal of catching defects before the code is released to production, and creates challenges for operations, whose goal is to maintain acceptable levels of reliability, because they can’t effectively track the changes and the potential impacts to dependent systems.
Narrow-minded goal setting within silos creates mistrust and organizational conflict between teams. DevOps aims to create more trust throughout the SDLC, so groups can better collaborate and optimize their processes, resulting in higher agility and morale.
- Sharing and collaboration
Sharing and collaboration go hand in hand. When experts in different domains work closely together, they produce better outcomes. A key component of collaboration is to share information: goals, lessons learned, feedback, and code samples. Without good collaboration, projects tend to fall into a waterfall mentality. For example, one development team I worked with finished coding and testing and requested a review from the security team—which rejected their unit of work because it didn’t meet their security requirements. Fixing these issues took several rounds of back-and-forth—and then they had to repeat the process for operations, compliance, and architecture. This led to longer lead times from when a customer requested a feature to when that feature was usable in production. The result was that business users became frustrated with the long lead times and started looking for IT solutions outside of IT.
DevOps aims to foster better collaboration across different technology domains like these, so it can address issues early. It also aims to create a culture of collaboration, where everyone works toward common outcomes.
- Removing waste
- Much of the DevOps mindset was adopted from Lean manufacturing processes and from writings like Eliyahu Goldratt’s The Goal (North River Press), which focuses on optimizing the production assembly line by identifying and removing waste and process bottlenecks. The processes for building and deploying software are often riddled with huge amounts of manual intervention, review gates, multiple approvals, and numerous other bottlenecks that reduce agility and often contribute very little to their goals of eliminating risks and improving quality. DevOps aims to drive system thinking throughout the SDLC with the goal of streamlining work and creating a culture of continuous improvement.
If DevOps embraces all of these ideas, why do so many organizations create a new silo called DevOps and focus on writing automation scripts, without collaborating with the product owners and developers? Some companies take their existing operations teams, adopt a few new tools, and call that DevOps. While these steps are in themselves progress, a nonholistic approach to DevOps will not deliver its promise. “DevOps” silos often lead to even more waste, because the focus is usually exclusively on the tools and scripting, not on the real goals of the product teams they support.
To understand DevOps, it is critical that we understand its roots. Its evolution started back in 2008. At the Agile 2008 Toronto conference, as he recalls it, Agile developer Andrew Shafer gave a presentation entitled “Agile Infrastructure.” Belgian infrastructure expert Patrick Debois was the only attendee. The two discussed how to use Agile infrastructure to resolve the bottlenecks and conflicts between development and operations, and their conversation blossomed into a collaboration. They created the Agile Systems Administration Group to try to improve life in IT.
At the 2009 O’Reilly Velocity Conference, John Allspaw and Paul Hammond got the IT community (including Debois) buzzing with their presentation “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr”. Back then, deploying multiple times a day was almost unheard of and would have been widely considered reckless and irresponsible. Inspired by the presentation, Debois set up a conference and invited his network on Twitter. He named it DevOpsDays and used the hashtag #DevOps when promoting it on Twitter. Were it not for Twitter’s 140-character limit at the time, we would likely have had a less succinct movement in the industry.
It is easy to see why many people think that DevOps is just about developers and operators working together. However, the DevOps net has since been cast much more broadly. Today DevOps touches all areas of business. Many start their DevOps journey looking only at automating the infrastructure or the application build process, but high-performing companies with more maturity in DevOps are redesigning their entire organizations and processes, both inside and outside of IT.
Changing the company culture and mindset is a critical success factor.
Removing waste and bottlenecks from the SDLC helps drive business value.
Shifting from reactive to proactive operations improves reliability.
It’s important to start somewhere and then continuously learn and improve.
DevOps and the cloud require a new operating model and organizational change.
Enterprises that embrace change and continuous learning look very different three or four years years into their journey. Table 1-2 shows the typical order in which companies address bottlenecks to improve delivery and business outcomes. Usually, as they make strides removing one bottleneck (for example, inconsistent environments), they then progress to resolve their next big bottleneck (for example, security).
|1||Nonrepeatable, error-prone build process||Continuous integration (CI)|
|2||Slow and inconsistent environment provisioning||Continuous delivery (CD)|
|3||Inefficient testing processes and handoffs||Shift testing left/test automation|
|4||Resistance from security team; process bottlenecks||Shift security left, DevSecOps|
|5||Painful handoff to operations teams, forced to use legacy tools/processes, poor MTTR||Shift ops left, new operating models (platform teams, SRE, etc.)|
|6||Slow service-level agreements from tiers 1–3 support||Shift support left, new operating models|
|7||Slow and painful approval processes for governance, risk, and compliance||Shift GRC left, stand up cloud GRC body|
These are just some of the problems I see, and the corresponding changes implemented to remove the bottlenecks. Many enterprises have large initiatives to reskill their workforce and even rethink the future of work. The incentives we offer workers must change to achieve the desired outcomes; procurement processes must change as we shift from licensing and maintenance to pay-as-you-go models; in short, every part of the organization is affected in one way or another.
Conclusion: People, Processes, Technology
Today we can build and deploy software faster than ever before. Cloud computing is a big reason why. CSPs are providing developers with a robust service catalog that abstracts the underlying infrastructure, allowing developers to focus more on business requirements and features. When cloud computing first became popular in the mid- to late 2000s, most people used the infrastructure as a service (IaaS) mindset. As developers became more experienced, they started leveraging higher levels of abstraction. Platform as a service (PaaS) abstracts away both the infrastructure and the application stack (the operating system, databases, middleware, and so forth). Software as a service (SaaS) vendors provide full-fledged software solutions; enterprises only need to make configuration changes to meet their requirements and manage user access.
Each one of these three cloud service models (IaaS, PaaS, and SaaS) can be huge accelerators for a business which no longer has to wait for the IT department to procure and install all of the underlying infrastructure, application stack, and build and maintain the entire solution.
In addition, technologies like serverless computing, containers, and fully managed services (such as databases as a service, blockchain as a service, and streaming as a service) are providing capabilities for developers to build systems much faster. I will discuss each of these concepts in more detail in Chapter 2. I’ll also look at immutable infrastructure and microservices, two important concepts that accelerate speed to market in the cloud.
But this transformation is bigger than CI/CD pipelines or Terraform templates. You learned in this chapter that organizational change, culture change, thinking and acting differently, modernizing how work gets done, and leveraging new tools and technologies are at the heart and soul of DevOps.
To scale DevOps across an organization, a new operating model is required. It’s a little ironic: IT departments have introduced so many new technologies, yet those IT departments themselves largely just kept running with the same old processes and structures, despite major changes in the underpinning technology. Even fairly large advances in methods in IT, most notably Agile, only changed processes in parts of silos, rather than looking holistically at the entire IT organization. Adopting these concepts in a silo without addressing their impact on people, processes, and technology across the entire SDLC is a recipe for failure. (Chapter 5 will discuss some of the patterns and antipatterns for operating models.)
Gaining that holistic view requires paying close attention not only to technology, but to the processes that run the tech and the people who carry out those processes. In the next three chapters, I’ll look at each of these in turn.
1 In the book The Big Switch (W. W. Norton), Nick Carr introduces us to the famous analogy of adoption of the power grid. Here I extend Carr’s thesis to show how the challenges of embracing the new ways of working that came with power are analogous to those of embracing cloud computing.
3 Synopsys describes Heartbleed as a bug that “allows anyone on the internet to read the memory of the systems protected by the vulnerable versions of the OpenSSL software. This compromises the secret keys used to identify the service providers and to encrypt the traffic, the names and passwords of the users, and the actual content. This allows attackers to eavesdrop on communications, steal data directly from the services and users, and to impersonate services and users.”