Somewhere around 2002, Jeff Bezos famously issued a mandate that described how software at Amazon had to be written. The tenets were as follows:
All teams will henceforth expose their data and functionality through service interfaces.
Teams must communicate with each other through these interfaces.
There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no backdoors whatsoever. The only communication allowed is via service interface calls over the network.
It doesn’t matter what technology they use.
All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
Anyone who doesn’t do this will be fired.
The above mandate was the precursor to Amazon Web Services (AWS), the original public cloud offering, and the foundation of everything we cover in this book. To understand the directives above and the rationale behind them is to understand the motivation for an enterprise-wide cloud migration. Jeff Bezos understood the importance of refactoring Amazon’s monolith for the cloud, even at a time when “the cloud” did not yet exist! Amazon’s radical success since, in part, has been due to their decision to lease their infrastructure to others and create an extensible company. Other forward-thinking companies such as Netflix run most of their business in Amazon’s cloud; Netflix even regularly speaks at AWS’s re:Invent conference about their journey to AWS. The Netflix situation is even more intriguing as Netflix competes with the Amazon Video offering! But, the cloud does not care; the cloud is neutral. There is so much value in cloud infrastructure like AWS that Netflix determined it optimal for a competitor to host their systems rather than incur the cost to build their own infrastructure.
Shared databases, shared tables, direct linking: these are typical early attempts at carving up a monolith. Many systems begin the modernization story by breaking apart at a service level only to remain coupled at the data level. The problem with these approaches is that the resulting high degree of coupling means that any changes in the underlying data model will need to be rolled out to multiple services, effectively meaning that you probably spent a fortune to transform a monolithic system into a distributed monolithic system. To phrase this another way, in a distributed system, a change to one component should not require a change to another component. Even if two services are physically separate, they are still coupled if a change to one requires a change in another. At that point they should be merged to reflect the truth.
The tenets in Bezos’ mandate hint that we should think of two services as autonomous collections of behavior and state that are completely independent of each other, even with respect to the technologies they’re implemented in. Each service would be required to have its own storage mechanisms, independent from and unknown to other services. No shared databases, no shared tables, no direct linking. Organizing services in this manner requires a shift in thinking along with using a set of specific, now well proven techniques. If many services are writing to the same table in a database it may indicate that the table should be its own service. By placing a small service called a shim in front of the shared resource, we effectively expose the resource as a service that can be accessed through a public API. We stop thinking about accessing data from databases and start thinking about providing data through services.
Effectively, the core of a modernization project requires architects and developers to focus less on the mechanism of storage, in this case a database, and more on the API. We can abstract away our databases by considering them as services, and by doing so we move in the right direction, thinking about everything in our organization as extensible services rather than implementation details. This is not only a profound technical change, but a cultural one as well. Databases are the antithesis of services and often the epitome of complexity. They often force developers to dig deep into the internals to determine the implicit APIs buried within, but for effective collaboration we need clarity and transparency. Nothing is more clear and transparent than an explicit service API.
According to the 451 Global Digital Infrastructure Alliance, a majority of enterprises surveyed are in two phases of cloud adoption: Initial Implementation (31%) or Broad Implementation (29%).1 A services-first approach to development plays a critical role in application modernization, which is one of three pillars of a successful cloud adoption initiative. The other two pillars are infrastructure refresh and security modernization.
Cloud adoption requires careful consideration at all levels of an enterprise. These are the three pillars of a successful cloud adoption:
Most organizations will adopt a hybrid-cloud topology, with servers both in the cloud and on the premises acting as a single cluster, and some organizations may also adopt a multi-cloud strategy for flexible deployments across multiple platforms such as AWS, Azure, and GCE, using both containers and VMs.
Each legacy application must be evaluated and modernized on a case-by-case basis to ensure it is ready to be deployed to a newly refreshed cloud infrastructure.
The security profile of components at the infrastructure and application layers will change dramatically; security must be a key focus of all cloud adoption efforts.
This book will cover all three pillars, with an emphasis on application modernization and migration. Legacy applications often depend directly on server resources, such as access to a local filesystem, while also requiring manual steps for day-to-day operations, such as accessing individual servers to check log files—a very frustrating experience if you have dozens of servers to check! Some basic refactorings are required for legacy applications to work properly on cloud infrastructure, but minimal refactorings only scratch the surface of what is necessary to make the most of cloud infrastructure.
This book will demonstrate how to treat the cloud as an unlimited pool of resources that brings both scale and resilience to your systems. While the cloud is an enabler for these properties, it doesn’t provide them out of the box; for that we must evolve our applications from legacy to cloud native.
We also need to think carefully about security. Traditional applications are secure around the edges, what David Strauss refers to as Death Star security, but once infiltrated these systems are completely vulnerable to attacks from within. As we begin to break apart our monoliths we expose more of an attack footprint to the outside world, which makes the system as a whole more vulnerable. Security must no longer come as an afterthought.
We will cover proven steps and techniques that will enable us to take full advantage of the power and flexibility of cloud infrastructure. But before we dive into specific techniques, let’s first discuss the properties and characteristics of cloud native systems.
The Cloud Native Computing Foundation is a Linux Foundation project that aims to provide stewardship and foster the evolution of the cloud ecosystem. Some of the most influential and impactful cloud-native technologies such as Kubernetes, Prometheus, and fluentd are hosted by the CNFC.
The CNFC defines cloud native systems as having three properties:
Running applications and processes in software containers as an isolated unit of application deployment, and as a mechanism to achieve high levels of resource isolation.
Actively scheduled and actively managed by a central orchestrating process.
Loosely coupled with dependencies explicitly described (e.g., through service endpoints).
Resource isolation is the key to building maintainable, robust applications. We can bundle our applications using technologies such as Docker, which allows us to create isolated units of deployment, while also eliminating inconsistencies when moving from environment to environment. With a single command we can build a container image that contains everything that our application requires, from the exact Linux distribution, to all of the command line tools needed at runtime. This gives us an isolated unit of deployment that we can start up on our local machine in the exact same way as in the cloud.
Once we begin to bundle and deploy our applications using containers, we need to manage those containers at runtime across a variety of cloud-provisioned hardware. The difference between container technologies such as Docker and virtualization technologies such as VMWare is the fact that containers abstract away machines completely. Instead, our system is composed of a number of containers that need access to system resources, such as CPU and memory. We don’t explicitly deploy container X to server Y. Rather, we delegate this responsibility to a manager, allowing it to decide where each container should be deployed and executed based on the resources the containers require and the state of our infrastructure. Technologies such as DC/OS from Mesosphere provide the ability to schedule and manage our containers, treating all of the individual resources we provision in the cloud as a single machine.
The difference between a big ball of mud and a maintainable system are well-defined boundaries and interfaces between conceptual components. We often talk about the size of a component, but what’s really important is the complexity. Measuring lines of code is the worst way to quantify the complexity of a piece of software. How many lines of code are complex? 10,000? 42?
Instead of worrying about lines of code, we must aim to reduce the conceptual complexity of our systems by isolating unique components from each other. Isolation helps to enhance the understanding of components by reducing the amount of domain knowledge that a single person (or team) requires in order to be effective within that domain. In essence, a well-designed component should be complex enough that it adds business value, but simple enough to be completely understood by the team which builds and maintains it.
Microservices are an architectural style of designing and developing components of container-packaged, dynamically managed systems. A service team may build and maintain an individual component of the system, while the architecture team understands and maintains the behaviour of the system as a whole.
Whether public, private, or hybrid, the cloud transforms infrastructure from physical servers into near-infinite pools of resources that are allocated to do work.
There are three distinct approaches to cloud infrastructure:
A hypervisor can be installed on a machine, and discrete virtual machines can be created and used allowing a server to contain many “virtual machines”
A container management platform can be used to manage infrastructure and automate the deployment and scaling of container packaged applications
A serverless approach foregoes building and running code in an environment and instead provides a platform for the deployment and execution of functions that integrate with public cloud resources (e.g., database, filesystem, etc.)
Installing a hypervisor such as VMWare’s ESXi was the traditional approach to creating a cloud. Virtual machines are installed on top of the hypervisor, with each virtual machine (VM) allocated a portion of the computer’s CPU and RAM. Applications are then installed inside an operating system on the virtual machine. This approach allows for better utilization of hardware compared to installing applications directly on the operating system as the resources are shared amongst many virtual machines.
Traditional public cloud offerings such as Amazon EC2 and Google Compute Engine (GCE) offer virtual machines in this manner. On-premise hardware can also be used, or a blend of the two approaches can be adopted (hybrid-cloud).
A more modern approach to cloud computing is becoming popular with the introduction of tools in the Docker ecosystem. Container management tools enable the use of lightweight VM-like containers that are installed directly on the operating system. This approach has the benefit of being more efficient than running VMs on a hypervisor, as only a single operating system is run on a machine instead of a full operating system with all of its overhead running within each VM. This allows most of the benefits of using full VMs, but with better utilization of hardware. It also frees us from some of the configuration management and potential licensing costs of running many extra operating systems.
Public container-based cloud offerings are also available such as Amazon EC2 Container Service (ECS) and Google Container Engine (GKE).
The difference between VMs and containers is outlined in Figure 1-1.
Another benefit of using a container management tool instead of a hypervisor is that the infrastructure is abstracted away from the developer. Management of virtual machine configuration is greatly simplified by using containers as all resources are configured uniformly in the “cluster.” In this scenario, configuration management tools like Ansible can be used to add servers to the container cluster, while configuration management tools like Chef or Puppet handle configuring the servers.
Once an organization adopts cloud infrastructure, there’s a natural gravitation towards empowering teams to manage their own applications and services. The operations team becomes a manager and provider of resources in the cloud, while the development team controls the flow and health of applications and services deployed to those resources. There’s no more powerful motivator for creating resilient systems than when a development team is fully responsible for what they build and deploy.
These approaches promise to turn your infrastructure into a self-service commodity that DevOps personnel can use and manage themselves. For example, DC/OS—“Datacenter Operating System” from Mesosphere—gives a friendly UI to all of the individual tools required to manage your infrastructure as if it were a single machine, so that DevOps personnel can log in, deploy, test, and scale applications without worrying about installing and configuring an underlying OS.
DC/OS is a collection of open source tools that act together to manage datacenter resources as an extensible pool. It comes with tools to manage the lifecycle of container deployments and data services, to aid in service discovery, load balancing, and networking. It also comes with a UI to allow teams to easily configure and deploy their applications.
DC/OS is centered around Apache Mesos, which is the distributed system kernel that abstracts away the resources of servers. Mesos effectively transforms a collection of servers into a pool of resources—CPU and RAM.
Mesos on its own can be difficult to configure and use effectively. DC/OS eases this by providing all necessary installation tools, along with supporting software such as Marathon for managing tasks, and a friendly UI to ease the management and installation of software on the Mesos cluster. Mesos also offers abstractions that allow stateful data service deployments. While stateless services can run in an empty “sandbox” every time they are run, stateful data services such as databases require some type of durable storage that persists through runs.
While we cover DC/OS in this guide primarily as a container management tool, DC/OS is quite broad in its capabilities.
Container management platforms manage how resources are allocated to each application instance, as well as how many copies of an application or service are running simultaneously. Similar to how resources are allocated to a virtual machine, a fraction of a server’s CPU and RAM are allocated to a running container. An application is easily “scaled out” with the click of a button, causing Marathon to deploy more containers for that application onto agents.
Additional agents can also be added to the cluster to extend the pool of resources available for containers to use. By default, containers can be deployed to any agent, and generally we shouldn’t need to worry about which server the instances are run on. Constraints can be placed on where applications are allowed to run to allow for policies such as security to be built into the cluster, or performance reasons such as two services needing to run on the same physical host to meet latency requirements.
Much like Marathon, Kubernetes—often abbreviated as k8s—automates the scheduling and deployment of containerized applications into pools of compute resources. Kubernetes has different concepts and terms than those that DC/OS uses, but the end result is very similar when considering container orchestration capabilities.
DC/OS is a more general-purpose tool than Kubernetes, suitable for running traditional services such as data services and legacy applications as well as container packaged services. Kubernetes might be considered an alternative to DC/OS’s container management scheduling capabilities alone—directly comparable to Marathon and Mesos rather than the entirety of DC/OS.
In Kubernetes, a pod is a group of containers described in a definition. The definition described is the “desired state,” which specifies what the running environment should look like. Similar to Marathon, Kubernetes Cluster Management Services will attempt to schedule containers into a pool of workers in the cluster. Workers are roughly equivalent to Mesos agents.
A kubelet process monitors for failure and notifies Cluster Management Services whenever a deviation from the desired state is detected. This enables the cluster to recover and return to a healthy condition.
For the purposes of this book, we will favor DC/OS’s approach. We believe that DC/OS is a better choice in a wider range of enterprise situations. Mesosphere offers commercial support, which is critical for enterprise projects, while also remaining portable across cloud vendors.
A common topology for enterprise cloud infrastructure is a hybrid-cloud model. In this model, some resources are deployed to a public cloud—such as AWS, GCP, or Azure—and some resources are deployed to a “private cloud” in the enterprise data center. This hybrid cloud can expand and shrink based on the demand of the underlying applications and other resources that are deployed to it. VMs can be provisioned from one or more of the public cloud platforms and added as an elastic extension pool to a company’s own VMs.
Both on-premise servers and provisioned servers in the cloud can be managed uniformly with DC/OS. Servers can be dynamically managed in the container cluster, which makes it easier to migrate from private infrastructure out into the public cloud; simply extend the pool of resources and slowly turn the dial from one to the other.
Hybrid clouds are usually sized so that most of the normal load can be handled by the enterprise’s own data center. The data center can continue to be built in a classical style and managed under traditional processes such as ITIL. The public cloud can be leveraged exclusively during grey sky situations, such as:
The hybrid-cloud model ensures a near-endless pool of global infrastructure resources available to expand into, while making better use of the infrastructure investments already made. A hybrid-cloud infrastructure is best described as elastic; servers can be added to the pool and removed as easily. Hybrid-cloud initiatives typically go hand-in-hand with multi-cloud initiatives, managed with tools from companies such as RightScale to provide cohesive management of infrastructure across many cloud providers.
Serverless technology enables developers to deploy purely stateless functions to cloud infrastructure, which works by pushing all state into the data tier. Serverless offerings from cloud providers include tools such as AWS Lambda and Google Cloud Functions.
This may be a reasonable architectural decision for smaller systems or organizations exclusively operating on a single cloud provider such as AWS or GCP, but for enterprise systems it’s often impossible to justify the lack of portability across cloud vendors. There are no open standards in the world of serverless computing, so you will be locked into whichever platform you build on. This is a major tradeoff compared to using an application framework on general cloud infrastructure, which preserves the option of switching cloud providers with little friction.
1 451 Global Digital Infrastructure Report, April 2017.