Chapter 4. Getting Cloud-Native Deployments Right
In traditional enterprise deployments, it’s typical to have only a few machines and a single shared database. We assume the server and database IP will never change. We may use a virtual IP address (VIP) to dynamically reroute traffic in the event of a failure, but for the most part keep things simple with hardcoded or config file–based static IPs.
Before a major deployment, everyone sits in a room and discusses the plan before submitting the change request. The Change Management team reviews all of the scheduled changes to ensure the change won’t interfere with other activities in the organization. They’ll approve the deployment after reviewing all of the artifacts in a Change Advisory Board (CAB) meeting.
To redeploy the application, everyone gets on a conference call in the middle of the night and starts the deployment process by changing the httpd configuration to display a maintenance page. Two hours later, after manually making changes to the database schema, and modifying some configuration by hand, everyone congratulates each other on a job well done, hangs up, and goes back to bed. As we fall asleep, we pretend everything is fine; there’s no way we made a few mistakes when we manually edited the environment, right?
“On the bright side, even if the worst happens it’s not like I own the company. I can always find another job.”
But the first time an enterprise team redeploys their systems in the middle of the day with no maintenance window is a special event.
“We’re deploying right now. We’re deploying in the middle of the day!”
Nobody can make a mistake with the configuration at the last minute because all changes—every aspect of deployment from start to finish—is codified, reviewed, tested, and automated. The processes are tested and stable, so we know that they will behave exactly the same way every time. Nobody logs into a server. In many cases, organizations will completely disable ssh access to production environments. This is what it means to have operational maturity.
Once a team reaches a high level of operational maturity it’s very difficult to introduce human error into the deployment equation. A successful deployment is always a click away, it takes only a few minutes—no downtime, no maintenance windows. The redeployment job removes a node from the load balancer, redeploys, and adds the node back to the load balancer, waiting for application health to be confirmed before moving onto the next node.
In this section we’re going to outline the best practices when deploying services, and also discuss integration of these new methodologies into the organization’s established processes.
Organizational Challenges
Before we get to the deployment best practices themselves, it’s worth highlighting some of the difficulties you’ll likely encounter when refreshing development and operational practices in larger, established organizations.
Most large organizations employ a set of “good practices” described in the Information Technology Infrastructure Library (ITIL). Practices such as change management are described in ITIL, and technology leadership across the globe believes that implementation of these practices is a yardstick to measure maturity of the technical organization.
Unfortunately, change management practices can be at odds with many of the modern deployment practices we’ve seen in the most innovative technology companies. In past years, monthly or even quarterly releases were the norm (and continue to be the norm in many enterprises).
With the explosion in popularity of Agile software development practices, development teams started to move toward sprint-based releases, with a single release at the end of each sprint whether that is weekly or bi-weekly. Teams would ensure that whatever code was checked in could be released at any point in time. Radical innovation in practices such as continuous delivery drop the cost of redeploys, and with good development practices and mature deployment pipelines, risk falls as well. According to a survey by New Relic, many organizations are deploying daily or even multiple times per day and the trend is toward more frequent deployments over time.
In change management practices, to log into a database and manually add a row is considered a “change.” To interact with an application that adds a row is not considered a “change” as the application has already been tested and approved for release.
To understand continuous deployment with respect to change management, we can frame our activities such that changes to the deployment pipeline should be treated as a change, but once the deployment pipeline is in place, deployments themselves are a piece of software functionality. A deployment is not a “change” then—it is a piece of functionality in deployed software. Changes to application code are trickier to fit into change management processes, so you may have to meet your organization in the middle.
Political challenges can also be difficult to overcome. There will be many people within an organization who are invested in the status quo. Clayton M. Christensen goes even further in his seminal book The Innovator’s Dilemma (HarperBusiness), describing that innovation within an organization is often treated like an “invasive species that the organization will attack like an immune system response, even if the attack of innovation causes the failure of the organization.”
The unknown is scary to most people. Changing the status quo can be threatening and feel “personal,” especially for those who were responsible for defining the status quo. It can be difficult to navigate political waters until you prove that the innovations you evangelize are effective. The role of a leader is not simply technical, it is also evangelism and navigating corporate politics. If you believe in your vision, you must build alliances and work towards small victories, using those initial victories to grow your alliances and further innovation. Only these activities will cause change within a large organization. This is why organizations move so slow: the most talented technical people within an organization can be the most resistant to politicking. But it’s essential to inspire change rather than force change—the latter is not effective.
Deployment Pipeline
If legacy infrastructure and deployment schedules are simple and infrequent enough, manual deployments may not be a significant cost problem. Even with human error taken into consideration, if deployments are infrequent and exist within a maintenance window it may not be logical to build new automated deployment capabilities. We can SSH into servers, run a few scripts, and call it a day. This is possible because we target infrequent—maybe quarterly—deployments, and the topology of our deployments is static.
When we have hundreds of servers in different geographic areas running dozens of different services with servers that change daily, the scale of the problem grows exponentially. It’s no longer reasonable to assume someone will SSH onto each server to deploy code and manually execute scripts, let alone the day-to-day tasks like pulling exception logs from individual servers. Even keeping a server inventory up-to-date is unreasonable at this point. What we need is a reliable mechanism to manage the entire lifecycle of our systems, including infrastructure, code, and deployments. In this scenario, the goal from day one should be to eventually disable SSH access to all servers. In order to do that, you need to consider all reasons why you would log on to a server, and then automate those tasks.
Abstractions exist that help us deal with the complexity of enterprise-scale infrastructure. More traditional VM infrastructure might have “instance groups” that boot machines from “instance templates,” while container orchestration platforms such as DC/OS and Kubernetes completely abstract away infrastructure from our applications and services.
Both paths are viable—it’s up to you and your organization to decide what works the best. A lot of innovation has gone into container deployments and related clustering platforms, so it would be advisable to consider them even if they are unfamiliar in your organization.
Infrastructure Automation
Before you deploy your applications, you should consider how your infrastructure is managed. Building, scaling, and removing servers should not be a human task.
Whether you’re leaning towards hypervisors or container management platforms, your best bet is to manage infrastructure the same way as you manage code. This might seem counterintuitive in a world where operations and developers live in their own worlds, but the emerging DevOps culture has created an “organizational Venn diagram” where the two disciplines overlap and collaborate.
If you’re logging on to servers and making manual changes, then you’ll certainly introduce “configuration drift” through human error. Even if you maintain servers for a long time, if you stop servers from being updated by hand, it’s possible to get very close to reasonably maintaining the same servers for a very long time without drift by using “idempotent updates” of the servers. Tools like Ansible are incredibly easy to pick up and learn for people of varying skill levels, and the resulting playbooks can be reviewed like code. “Pull requests” are made against infrastructure and configuration management code bases, where all stakeholders have a chance to review and catch any issues before they are promoted to live environments.
“Immutable servers” are preferred where servers are regularly decommissioned and rebuilt instead of simply updating the running infrastructure. As discussed, this type of approach will ensure beyond any doubt that the running servers accurately reflect the configuration that has been committed.
Tools like DC/OS and Kubernetes can make these approaches clearer by ensuring that the container is mostly isolated from the environment.
The only variants between two instances of the same contained packaged application are:
- The configuration of the application inside the container (immutable application configuration)
- The configuration of the container itself (e.g., the configuration expressed in a Dockerfile)
Both variants are clearly expressed in code when working with containers and container management tools like DC/OS or Kubernetes, which means there should no potential of configuration drift.
Expressing infrastructure as code should be a top priority for teams in complex environments, especially as additional resources are provisioned to expand the capacity of the cluster.
Configuration in the Environment
It’s tempting to treat configuration as an afterthought. After all, it’s not the hottest area of modern computing, but it’s an important one. A look at post-mortems shows that configuration is a common cause of incidents.
Best practices related to distributed systems are outlined in The Twelve-Factor App. Twelve-Factor Apps pass configuration to the application from the environment. For instance, the IP of the database is not added to the application’s configuration, but instead it is added to the environment and referenced from a variable. Success is measurable. For instance, can the same artifact running in a pre-production environment be copied to the production servers? If the answer is no, explore what decisions have been made that limit this portability, and research if environment variables can mitigate the issue.
For example, in DC/OS and Kubernetes, it’s quite easy to create and reference environment variables. The Typesafe Config library can be declared to use a variable in the environment or else fall back to development mode variables (Example 4-1).
Example 4-1. Typesafe Config library example of using environment variables along with a fallback mechanism to use development mode configuration if the environment variables are not present.
hostname = "127.0.0.1" hostname = ${?HOST}
The second variable is noted as optional with the ?
character. 127.0.0.1
will be used if no $HOST
variable is present in the environment. This allows a developer to boot the application locally in test mode, while allowing a startup script installed in the environment or container management task scheduling tool such as Marathon to pass in variables. As long as tooling in the deployment chain considers these qualities, it’s possible to use environment variables for configuration.
Artifacts from Continuous Integration
The build artifacts need to be stored somewhere for deployment. The decision to use either containers or traditional virtualization infrastructure will change how the artifacts are stored and deployed.
If you’re using VMs, you’ll likely store the application build artifacts after successful builds in your continuous integration (CI) tool such as Jenkins or Bamboo. After building the application, it’s common to have an automatic deployment into a test environment. From there, it’s easy to take the artifact and “promote” it to further environments—eventually to QA. Assuming the configuration is stored in the environment, you never need to build an artifact specifically for production.
As mentioned, if you’re using traditional VM infrastructure, we recommend destroying and recreating servers with a tool such as Terraform instead of trying to manage the servers over multiple releases with a tool such as Ansible alone. Ansible promises idempotent changes, but there is a risk of “drift” that may have some impact eventually.
If you’re using Kubernetes or DC/OS, then the process is slightly different:
- The end of the build process from CI pushes the container to a private Docker registry.
- The cluster task—for example, Marathon configuration or Pod definition—is modified with the new docker image from the build task.
- The platform will note that the state of the configuration is different than the running task so it will begin a rolling restart.
Promotion of an artifact is done by updating the Marathon or Pod definition in the next environment.
Logs
Logs should be sent to a central location where they can be searched via a log aggregation mechanism such as the ELK stack. The ELK stack is powered by ElasticSearch and fed logs by Logstash, an agent that runs on each machine to collect and transport logs from individual servers. Finally, Kibana rounds out ELK by providing a friendly UI for querying, navigating, and displaying log information in ElasticSearch.
Be Aware of Existing Tools
It’s common for enterprises to have their own log aggregation tools such as Splunk, so it’s important to do full research of what already exists in your organization before advocating for the introduction of new tools.
A topic worth highlighting is the importance of a standardized log format. If extra information, such as correlation (trace) IDs or user IDs, is inserted into log lines across all services in a uniform manner, tools like Logstash can ensure that those entries are correctly indexed. Standardizing the log format across all applications and services allows easier insight into system-wide logs by providing uniform navigability of all log entries.
Autoscaling
Rules can be put in place to automatically add extra resources to our infrastructure when needed. Traffic patterns in production systems are often unpredictable, so the ability to add resources on the fly becomes critical to achieve our goal of implementing scalable, elastic systems.
For example, some of the record-setting Powerball lotteries in the United States have caused unexpectedly large quantities of traffic to lottery websites as everyone rushes to see if they hit the jackpot! To ensure that services stay available—at a reasonable cost—they need to be scaled out and back in as traffic patterns change. Otherwise, teams will need to provision more hardware than required at the highest possible peak of traffic, which at most times will be an enormous waste of money as resources go underutilized.
In the case of VMs, images can be used to bootstrap an application so that provisioning resources doesn’t require human intervention. You’ll need to ensure that all build processes and autoscaling processes trigger the exact same “infrastructure as code” and “configuration as code” mechanisms to reliably configure the servers.
Tools such as DC/OS and Kubernetes can be configured to allocate more resources to a service when CPU utilization or requests per second thresholds are exceeded. You’ll need to consider how to expand the cluster itself if it reaches a critical point, by adding more servers to the public cloud (assuming a hybrid-cloud topology).
As has been mentioned, your applications need to start quickly to ensure that autoscaling happens before application health is impacted by rapid spikes in traffic.
Scaling Down
Removing nodes, it turns out, is a lot harder then adding them. We must be tolerant of some failures occurring. We want to get as close to zero failures as possible during scale-down events and redeployments, although realistically we must be tolerant of a few dropped requests.
The native way to deal with shutdown signals is to send the process a SIGTERM
to notify the application that it should clean up and shutdown. For applications that are running in containers, Docker will issue a SIGTERM
to your application when the container starts to shut down, then wait a few seconds, and then send a SIGKILL
if the application has not shut itself down. By building your applications around SIGTERM
you ensure that they can run anywhere. Some tools use HTTP endpoints to control shutdown, but it’s better to be infrastructure agnostic and prefer SIGTERM
as the signal for graceful shutdown.
Akka Cluster improved its graceful shutdown significantly in Akka 2.5, so developers don’t necessarily need to implement specific shutdown logic. Actors move between running nodes in the cluster during shutdown, so there may be a very small window when requests in flight won’t hit the referenced actor, but an Akka cluster will migrate and recover from graceful shutdown quickly.
For external-facing applications you want to “drain” nodes before shutting down by removing the application from the load balancer gracefully, and then waiting a short period of time for any requests in-flight to finish. By implementing node draining logic you will be able to scale down or redeploy in the middle of the day without dropping a single user request!
For DC/OS in particular, if you’re using HTTP/REST, we would recommend you look at using the Layer 7 Marathon-LB and trying the Zero Downtime Deployment script (zdd.py) for redeploys as the mechanism of choice from your CI tool.
Service Discovery
A key difference in cloud computing compared to more traditional deployments is that the location of services may not be known and cannot be easily managed manually. Heritage enterprise applications typically have a configuration file for each environment where services and databases are described, along with locations such as IP addresses.
Approaches of managing IPs in configuration can be problematic in cloud services. If servers restart in cloud deployments, or if you scale up or down, then IPs can change or otherwise be unknown to already running applications so a service discovery mechanism is needed that will work more reliably. There are a few approaches that can be used depending on the infrastructure strategy selected.
Service Registry
Services can register themselves in a datastore such as Zookeeper or Consul. These key-value stores were built for this type of problem. They can take care of managing a session to clean up any nodes that crash ungracefully, as well as notifying all watchers of any changes in the state of running applications.
As an example, both Kafka and DC/OS Mesos Masters use Zookeeper for discovering and coordinating running nodes. These solutions are reliable and battle-tested, but they do require writing some code in the client for registration and discovery. Client libraries like Curator from Netflix have many common recipes built in to make this relatively simple and to eliminate errors.
Do Not Share Zookeeper Clusters
Although Kafka or DC/OS may come with Zookeeper out of the box, in order to isolate incidents it’s recommended that you run a separate Zookeeper cluster for service discovery.
DNS or Load Balancers
DNS or load balancers (LBs) can be used to aid in service discovery. In certain cases you need to know the exact location of a server, so this will not fit all use cases. However, TCP traffic can be routed through a level 4 LB, so this approach is generic enough to work with most applications.
Kubernetes and DC/OS have their own service discovery abstractions that give a name to a service to allow its members to be discovered using a blend of DNS (to find the load balancer) and layer 4 or layer 7 load-balancing mechanisms.
Cluster/Gossip
Some tools such as Cassandra or Akka require only that a “seed node” be discoverable and then the cluster will “gossip” about its other members to keep information about the dynamic state of the cluster up-to-date.
If running on traditional infrastructure or on a hypervisor it can be easier to maintain a few seed nodes for use in discovering the service. In Kubernetes or DC/OS it may be easier to use another service discovery mechanism to find another running node first.
Tools exist to aid in managing the bootstrapping of clusters in different environments. For example, in DC/OS there is a Cassandra package that will use Mesos’s DNS to ease discovery of running Cassandra nodes. For Akka Cluster, an open source library called ConstructR exists that can aid in seed node discovery. Lightbend also has a commercial tool, ConductR, that can aid in managing your running cluster on container orchestration platforms such as DC/OS or Kubernetes.
Cloud-Ready Active-Passive
A similar problem to service discovery is leader election. Often a single instance of an application will be responsible for a task—batch processing in an enterprise for example—where having multiple running instances would cause data corruption or other race conditions.
Rather than deploying a single instance, the application can be modified so that one instance becomes the “leader” and the other instances sit on standby to fill in in case the leader fails.
Tools like Zookeeper and Consul can be used to implement leader election in order to ensure that only a single node considers itself the leader of a given task. In the case of failure, the other nodes will be notified and will elect a new leader to take over the work. This is a useful pattern in porting existing systems that need to be the only instance doing any processing (singleton instances).
Alternatively, Akka Cluster has a mechanism called “Cluster Singleton” that will offload the leadership election concern to Akka. This ensures that only one of the “singleton actors” are running in the cluster at any given time. This has the same effect as implementing your own leadership election logic, but relies on gossip instead of a coordination service.
While this approach can be useful for porting existing logic, if building services from scratch—especially services that are publicly consumed—it’s often better to avoid leadership election approaches and instead use other methods that allow work sharing to ensure better availability and scalability.
Failing Fast
Inside an actor system, Akka will respond to unexpected failure by killing and restarting an actor and dropping the message that caused the failure. For many unexpected application-local errors, this is a reasonable approach as we cannot assume that either the message or the state of the actor are recoverable. Similarly, in distributed systems, there are situations in which intentionally crashing the application is the safest approach.
It might seem logical to try to gracefully handle failure inside an application but often taking a pessimistic approach is safer and more correct. As an example, if a lock is held in a remote service such as Zookeeper, a “disconnect event” for that service would mean that the lock is in an unknown state. The event could be a result of a long GC pause for example—so we can’t make any guarantees about how much time has passed or what other applications are doing with the expired lock.
While the chance of a pathological situation may be low, given enough servers running for long enough those pathological scenarios become more likely to be encountered. As we can’t make any assumptions about what other applications have done after the lock expiry, the safest response to this type of situation is to throw an exception and shut down the node. If a system is built in a resilient and fault-tolerant manner, then the shutdown of the application will not have a serious impact. It should be acceptable for a few requests to be dropped without causing a catastrophic failure or unrecoverable inconsistencies.
When applications crash, it’s important that they are restarted automatically. Kubernetes and DC/OS will handle these events by noting that the process has died, and will attempt to restart it immediately. If deploying into an OS, a wrapper should be used to ensure the process runs. For example, in Linux, serviced
is often used to start and monitor containers or processes, handling any failures automatically. An application crash should never require manual intervention to restart.
Split Brains and Islands
One of the biggest risks in running stateful clustered services is the risk of “split brain” scenarios. If a portion of nodes in a cluster becomes unavailable for a long period of time, there is no way for the rest of the services running to know if the other nodes are still running or not. Those other nodes may have had a netsplit, or they may have had a hard crash—there’s no way to know for sure.
It’s possible that they will eventually become accessible again. The problem is that, if two sides of a still running cluster are both working in isolation due to a netsplit, they must come to some conclusion of who should continue running and who should shut down. A cluster on one side of the partition must be elected as the surviving portion of the cluster and the other cluster must shut down. The worst-case scenario is that both portions of the cluster think they are the cluster that should be running, which can lead to major issues, such as two singleton actors (one on each side of the partition) continuing to work, causing duplicated entities and completely corrupt each other’s data.
If you’re building your own clustered architecture using Akka, Akka’s Split Brain Resolver (SBR) will take care of these scenarios for you by ensuring that a strategy is in place, such as keep majority or keep oldest, depending on whether you wish to keep the largest of a split cluster or the oldest of a split cluster. The strategy itself is configurable. Regardless of what strategy you choose, with SBR it’s possible that your entire cluster will die, but your solution should handle this by restarting the cluster so the impact will be minimized.
As much of a risk as split brain is the possibility of creating islands, where two clusters are started and are unaware of each other. If this occurs with Akka Cluster, it’s almost always as a result of misconfiguration. The ConstructR library, or Lightbend’s ConductR, can mitigate these issues by having a coordination service ensure that only one cluster can be created.
When using another clustered and stateful technology, such as another framework or datastore, you should carefully evaluate if split brain or island scenarios are possible, and understand how the tool resolves such scenarios.
Split brain and island scenarios can cause data corruption, so it’s important to mitigate any such scenarios early and carefully consider approaches for prevention.
Putting It All Together with DC/OS
We’re looked at some of the concerns and approaches related to enterprise deployments, and covered how applications behave at runtime. We will conclude with a quick view into what an end-to-end example might look like through to delivery with DC/OS as a target platform.
First, code is stored in a repository. It’s common for smaller organizations to use GitHub in the public cloud, but generally enterprise organizations keep code within the safety of the company network, so a local installation of GitLab or GitHub Enterprise might be used instead.
Whenever code is checked in, it will trigger a CI tool such as GitLab’s built-in continuous integration functionality—or a separate tool such as Jenkins—to check out and test the latest code automatically. After the CI tool compiles the code a Docker image will be created and stored in a container registry such as Docker Hub or Amazon’s ECR (Elastic Container Repository).
Assuming that the test and build succeeds, DC/OS’s app definition (expressed in a Marathon configuration configuration file) in the test environment is updated by the CI tool pointing Marathon to the location of the newest Docker image. This triggers Marathon to download the image and deploy (or redeploy) it.
The configuration in Example 4-2 contains the Docker image location, port, and network information (including the ports that the container should expose). This example configuration uses a fixed host port, but often you’ll be using random ports to allow multiple instances to be deployed to the same host.
Example 4-2. Example Marathon configuration.
{ "id": "my-awesome-app", "container": { "type": "DOCKER", "docker": { "image": "location/of/my/image", "network": "BRIDGE", "portMappings": [ { "hostPort": 80, "containerPort": 80, "protocol": "tcp"} ] } }, "instances": 4, "cpus": 0.1, "mem": 64, "upgradeStrategy": { "minimumHealthCapacity": 1, "maximumOverCapacity": 0.3 } }
This example is of a basic configuration, but contains everything needed to have Marathon deploy an application into the cluster. The configuration describes the horizontal and vertical scale qualities of the deployments—the number of instances and the resources provided to each of those instances. You’ll note there is no information about where the applications are going to run: there are no provisioning servers, no logging onto servers, no Linux versions, no configuration, and no dependencies. This overlap between ops and development, and the enablement of teams that these solutions provide, highlights the value of using container orchestration frameworks.
Ideally, Marathon should be configured to not drop below the current number of nodes in the deployment. It will start a redeploy by adding a small number of nodes, waiting for them to give a readiness signal via a health check endpoint.
Likewise, you don’t want Marathon to start too many services all at once. These behaviors are described in the upgradeStrategy
section in Example 4-2. This works by configuring the minimumHealthCapacity
and maximumOverCapacity
values. If minimumHealthCapacity
is set to 1
, Marathon will ensure the number of servers never falls below 100% of the number of instances configured during redeploys. maximumOverCapacity
dictates how many nodes should start up at a time. Defining the value at .3
, we instruct Marathon to not replace more than 30% of the instances at a time. Once the new batch of nodes are up and running and return a 200 status code from the health check endpoints configured, Marathon will then instruct Docker to shut down, which causes SIGTERM
to be issued to the application, and after a configurable time, SIGKILL
if the process has not shut down before exceeding the configured threshold. Marathon will continue to replace nodes like this until all are running in the new cluster.
Similar to the deployment described above, once the individuals on the team have validated the deployment, they might click a “promotion” button, or merge the changes into a master branch which will cause the application to be deployed to production.
You’ll note that infrastructure is not described here at all. We’re not using Chef, Puppet, or Ansible to maintain the environment, or Terraform to build new servers—we don’t need to do so because the containers contain everything they need to run the application. There is no separate configuration or separate production build because the environment contains the variables the application needs to run. Service discovery also exists in the environment to allow applications to find the services they depend on.
Herein lies the real benefits of using a container orchestration framework instead of a traditional hypervisor with VMs—while the tools have more specialized abstractions, once set up, they require significantly less manual work to maintain and use. Developers can much more easily create integrations into the environment for work that they may be doing.
Get Migrating Java to the Cloud now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.