Migrating to cloud-native application architectures
Adoption of cloud-native application architectures is helping many organizations transform their IT into a force for true agility in the marketplace.
Adoption of cloud-native application architectures is helping many organizations transform their IT into a force for true agility in the marketplace.
Software is eating the world.
Stable industries that have for years been dominated by entrenched leaders are rapidly being disrupted, and they’re being disrupted by businesses with software at their core. Companies like Square, Uber, Netflix, Airbnb, and Tesla continue to possess rapidly growing private market valuations and turn the heads of executives of their industries’ historical leaders. What do these innovative companies have in common?
Moving to the cloud is a natural evolution of focusing on software, and cloud-native application architectures are at the center of how these companies obtained their disruptive character. By cloud, we mean any computing environment in which computing, networking, and storage resources can be provisioned and released elastically in an on-demand, self-service manner. This definition includes both public cloud infrastructure (such as Amazon Web Services, Google Cloud, or Microsoft Azure) and private cloud infrastructure (such as VMware vSphere or OpenStack).
In this chapter we’ll explain how cloud-native application architectures enable these innovative characteristics. Then we’ll examine a few key aspects of cloud-native application architectures.
First we’ll examine the common motivations behind moving to cloud-native application architectures.
It’s become clear that speed wins in the marketplace. Businesses that are able to innovate, experiment, and deliver software-based solutions quickly are outcompeting those that follow more traditional delivery models.
In the enterprise, the time it takes to provision new application environments and deploy new versions of software is typically measured in days, weeks, or months. This lack of speed severely limits the risk that can be taken on by any one release, because the cost of making and fixing a mistake is also measured on that same timescale.
Internet companies are often cited for their practice of deploying hundreds of times per day. Why are frequent deployments important? If you can deploy hundreds of times per day, you can recover from mistakes almost instantly. If you can recover from mistakes almost instantly, you can take on more risk. If you can take on more risk, you can try wild experiments—the results might turn into your next competitive advantage.
The elasticity and self-service nature of cloud-based infrastructure naturally lends itself to this way of working. Provisioning a new application environment by making a call to a cloud service API is faster than a form-based manual process by several orders of magnitude. Deploying code to that new environment via another API call adds more speed. Adding self-service and hooks to teams’ continuous integration/build server environments adds even more speed. Eventually we can measure the answer to Lean guru Mary Poppendick’s question, “How long would it take your organization to deploy a change that involves just one single line of code?” in minutes or seconds.
Imagine what your team…what your business…could do if you were able to move that fast!
It’s not enough to go extremely fast. If you get in your car and push the pedal to the floor, eventually you’re going to have a rather expensive (or deadly!) accident. Transportation modes such as aircraft and express bullet trains are built for speed and safety. Cloud-native application architectures balance the need to move rapidly with the needs of stability, availability, and durability. It’s possible and essential to have both.
As we’ve already mentioned, cloud-native application architectures enable us to rapidly recover from mistakes. We’re not talking about mistake prevention, which has been the focus of many expensive hours of process engineering in the enterprise. Big design up front, exhaustive documentation, architectural review boards, and lengthy regression testing cycles all fly in the face of the speed that we’re seeking. Of course, all of these practices were created with good intentions. Unfortunately, none of them have provided consistently measurable improvements in the number of defects that make it into production.
So how do we go fast and safe?
As demand increases, we must scale our capacity to service that demand. In the past we handled more demand by scaling vertically: we bought larger servers. We eventually accomplished our goals, but slowly and at great expense. This led to capacity planning based on peak usage forecasting. We asked “what’s the most computing power this service will ever need?” and then purchased enough hardware to meet that number. Many times we’d get this wrong, and we’d still blow our available capacity during events like Black Friday. But more often we’d be saddled with tens or hundreds of servers with mostly idle CPU’s, which resulted in poor utilization metrics.
Innovative companies dealt with this problem through two pioneering moves:
As public cloud infrastructure like Amazon Web Services became available, these two moves converged. The virtualization effort was delegated to the cloud provider, and the consumer focused on horizontal scale of its applications across large numbers of cloud server instances. Recently another shift has happened with the move from virtual servers to containers as the unit of application deployment. We’ll discuss containers in Containerization.
This shift to the cloud opened the door for more innovation, as companies no longer required large amounts of startup capital to deploy their software. Ongoing maintenance also required a lower capital investment, and provisioning via API not only improved the speed of initial deployment, but also maximized the speed with which we could respond to changes in demand.
Unfortunately all of these benefits come with a cost. Applications must be architected differently for horizontal rather than vertical scale. The elasticity of the cloud demands ephemerality. Not only must we be able to create new application instances quickly; we must also be able to dispose of them quickly and safely. This need is a question of state management: how does the disposable interact with the persistent? Traditional methods such as clustered sessions and shared filesystems employed in mostly vertical architectures do not scale very well.
Another hallmark of cloud-native application architectures is the externalization of state to in-memory data grids, caches, and persistent object stores, while keeping the application instance itself essentially stateless. Stateless applications can be quickly created and destroyed, as well as attached to and detached from external state managers, enhancing our ability to respond to changes in demand. Of course this also requires the external state managers themselves to be scalable. Most cloud infrastructure providers have recognized this necessity and provide a healthy menu of such services.
In January 2014, mobile devices accounted for 55% of Internet usage in the United States. Gone are the days of implementing applications targeted at users working on computer terminals tethered to desks. Instead we must assume that our users are walking around with multicore supercomputers in their pockets. This has serious implications for our application architectures, as exponentially more users can interact with our systems anytime and anywhere.
Take the example of viewing a checking account balance. This task used to be accomplished by calling the bank’s call center, taking a trip to an ATM location, or asking a teller at one of the bank’s branch locations. These customer interaction models placed significant limits on the demand that could be placed on the bank’s underlying software systems at any one time.
The move to online banking services caused an uptick in demand, but still didn’t fundamentally change the interaction model. You still had to physically be at a computer terminal to interact with the system, which still limited the demand significantly. Only when we all began, as my colleague Andrew Clay Shafer often says, “walking around with supercomputers in our pockets,” did we start to inflict pain on these systems. Now thousands of customers can interact with the bank’s systems anytime and anywhere. One bank executive has said that on payday, customers will check their balances several times every few minutes. Legacy banking systems simply weren’t architected to meet this kind of demand, while cloud-native application architectures are.
The huge diversity in mobile platforms has also placed demands on application architectures. At any time customers may want to interact with our systems from devices produced by multiple different vendors, running multiple different operating platforms, running multiple versions of the same operating platform, and from devices of different form factors (e.g., phones vs. tablets). Not only does this place various constraints on the mobile application developers, but also on the developers of backend services.
Mobile applications often have to interact with multiple legacy systems as well as multiple microservices in a cloud-native application architecture. These services cannot be designed to support the unique needs of each of the diverse mobile platforms used by our customers. Forcing the burden of integration of these diverse services on the mobile developer increases latency and network trips, leading to slow response times and high battery usage, ultimately leading to users deleting your app. Cloud-native application architectures also support the notion of mobile-first development through design patterns such as the API Gateway, which transfers the burden of service aggregation back to the server-side. We’ll discuss the API Gateway pattern in API Gateways/Edge Services.
Now we’ll explore several key characteristics of cloud-native application architectures. We’ll also look at how these characteristics address motivations we’ve already discussed.
The twelve-factor app is a collection of patterns for cloud-native application architectures, originally developed by engineers at Heroku. The patterns describe an application archetype that optimizes for the “why” of cloud-native application architectures. They focus on speed, safety, and scale by emphasizing declarative configuration, stateless/shared-nothing processes that horizontally scale, and an overall loose coupling to the deployment environment. Cloud application platforms like Cloud Foundry, Heroku, and Amazon Elastic Beanstalk are optimized for deploying twelve-factor apps.
In the context of twelve-factor, application (or app) refers to a single deployable unit. Organizations will often refer to multiple collaborating deployables as an application. In this context, however, we will refer to these multiple collaborating deployables as a distributed system.
A twelve-factor app can be described in the following ways:
These characteristics lend themselves well to deploying applications quickly, as they make few to no assumptions about the environments to which they’ll be deployed. This lack of assumptions allows the underlying cloud platform to use a simple and consistent mechanism, easily automated, to provision new environments quickly and to deploy these apps to them. In this way, the twelve-factor application patterns enable us to optimize for speed.
These characteristics also lend themselves well to the idea of ephemerality, or applications that we can “throw away” with very little cost. The application environment itself is 100% disposable, as any application state, be it in-memory or persistent, is extracted to some backing service. This allows the application to be scaled up and down in a very simple and elastic manner that is easily automated. In most cases, the underlying platform simply copies the existing environment the desired number of times and starts the processes. Scaling down is accomplished by halting the running processes and deleting the environments, with no effort expended backing up or otherwise preserving the state of those environments. In this way, the twelve-factor application patterns enable us to optimize for scale.
Finally, the disposability of the applications enables the underlying platform to automatically recover from failure events very quickly. Furthermore, the treatment of logs as event streams greatly enables visibility into the underlying behavior of the applications at runtime. The enforced parity between environments and the consistency of configuration mechanisms and backing service management enable cloud platforms to provide rich visibility into all aspects of the application’s runtime fabric. In this way, the twelve-factor application patterns enable us to optimize for safety.
Microservices represent the decomposition of monolithic business systems into independently deployable services that do “one thing well.” That one thing usually represents a business capability, or the smallest, “atomic” unit of service that delivers business value.
Microservice architectures enable speed, safety, and scale in several ways:
Teams developing cloud-native application architectures are typically responsible for their deployment and ongoing operations. Successful adopters of cloud-native applications have empowered teams with self-service platforms.
Just as we create business capability teams to build microservices for each bounded context, we also create a capability team responsible for providing a platform on which to deploy and operate these microservices (The Platform Operations Team).
The best of these platforms raise the primary abstraction layer for their consumers. With infrastructure as a service (IAAS) we asked the API to create virtual server instances, networks, and storage, and then applied various forms of configuration management and automation to enable our applications and supporting services to run. Platforms are now emerging that allow us to think in terms of applications and backing services.
Application code is simply “pushed” in the form of pre-built artifacts (perhaps those produced as part of a continuous delivery pipeline) or raw source code to a Git remote. The platform then builds the application artifact, constructs an application environment, deploys the application, and starts the necessary processes. Teams do not have to think about where their code is running or how it got there, as the platform takes care of these types of concerns transparently.
The same model is supported for backing services. Need a database? How about a message queue or a mail server? Simply ask the platform to provision one that fits your needs. Platforms now support a wide range of SQL/NoSQL data stores, message queues, search engines, caches, and other important backing services. These service instances can then be “bound” to your application, with necessary credentials automatically injected into your application’s environment for it to consume. A great deal of messy and error-prone bespoke automation is thereby eliminated.
These platforms also often provide a wide array of additional operational capabilities:
This combination of tools ensures that capability teams are able to develop and operate services according to agile principles, again enabling speed, safety, and scale.
The sole mode of interaction between services in a cloud-native application architecture is via published and versioned APIs. These APIs are typically HTTP REST-style with JSON serialization, but can use other protocols and serialization formats.
Teams are able to deploy new functionality whenever there is a need, without synchronizing with other teams, provided that they do not break any existing API contracts. The primary interaction model for the self-service infrastructure platform is also an API, just as it is with the business services. Rather than submitting tickets to provision, scale, and maintain application infrastructure, those same requests are submitted to an API that automatically services the requests.
Contract compliance can be verified on both sides of a service-to-service interaction via consumer-driven contracts. Service consumers are not allowed to gain access to private implementation details of their dependencies or directly access their dependencies’ data stores. In fact, only one service is ever allowed to gain direct access to any data store. This forced decoupling directly supports the cloud-native goal of speed.
The concept of antifragility was introduced in Nassim Taleb’s book Antifragile (Random House). If fragility is the quality of a system that gets weaker or breaks when subjected to stressors, then what is the opposite of that? Many would respond with the idea of robustness or resilience—things that don’t break or get weaker when subjected to stressors. However, Taleb introduces the opposite of fragility as antifragility, or the quality of a system that gets stronger when subjected to stressors. What systems work that way? Consider the human immune system, which gets stronger when exposed to pathogens and weaker when quarantined. Can we build architectures that way? Adopters of cloud-native architectures have sought to build them. One example is the Netflix Simian Army project, with the famous submodule “Chaos Monkey,” which injects random failures into production components with the goal of identifying and eliminating weaknesses in the architecture. By explicitly seeking out weaknesses in the application architecture, injecting failures, and forcing their remediation, the architecture naturally converges on a greater degree of safety over time.
In this chapter we’ve examined the common motivations for moving to cloud-native application architectures in terms of abilities that we want to provide to our business via software:
We’ve also examined the unique characteristics of cloud-native application architectures and how they can help us provide these abilities:
In the next chapter we’ll examine a few of the changes that most enterprises will need to make in order to adopt cloud-native application architectures.