Chapter 1. What Is a “Cloud Native” Application?

The most dangerous phrase in the language is, “We’ve always done it this way.”1

Grace Hopper, Computerworld (January 1976)

If you’re reading this book, then you’ve no doubt at least heard the term cloud native before. More likely, you’ve probably seen some of the many, many articles written by vendors bubbling over with breathless adoration and dollar signs in their eyes. If this is the bulk of your experience with the term so far, then you can be forgiven for thinking the term to be ambiguous and buzzwordy, just another of a series of markety expressions that might have started as something useful but have since been taken over by people trying to sell you something. See also: Agile, DevOps.

For similar reasons, a web search for “cloud native definition” might lead you to think that all an application needs to be cloud native is to be written in the “right” language2 or framework, or to use the “right” technology. Certainly, your choice of language can make your life significantly easier or harder, but it’s neither necessary nor sufficient for making an application cloud native.

Is cloud native, then, just a matter of where an application runs? The term cloud native certainly suggests that. All you’d need to do is pour your kludgy3 old application into a container and run it in Kubernetes, and you’re cloud native now, right? Nope. All you’ve done is make your application harder to deploy and harder to manage.4 A kludgy application in Kubernetes is still kludgy.

So, what is a cloud native application? In this chapter, we’ll answer exactly that. First, we’ll examine the history of computing service paradigms up to (and especially) the present, and discuss how the relentless pressure to scale drove (and continues to drive) the development and adoption of technologies that provide high levels of dependability at often vast scales. Finally, we’ll identify the specific attributes associated with such an application.

The Story So Far

The story of networked applications is the story of the pressure to scale.

The late 1950s saw the introduction of the mainframe computer. At the time, every program and piece of data was stored in a single giant machine that users could access by means of dumb terminals with no computational ability of their own. All the logic and all the data all lived together as one big happy monolith. It was a simpler time.

Everything changed in the 1980s with the arrival of inexpensive network-connected PCs. Unlike dumb terminals, PCs were able to do some computation of their own, making it possible to offload some of an application’s logic onto them. This new multitiered architecture—which separated presentation logic, business logic, and data (Figure 1-1)—made it possible, for the first time, for the components of a networked application to be modified or replaced independent of the others.

cngo 0101
Figure 1-1. A traditional three-tiered architecture, with clearly defined presentation, business logic, and data components

In the 1990s, the popularization of the World Wide Web and the subsequent “dot-com” gold rush introduced the world to software as a service (SaaS). Entire industries were built on the SaaS model, driving the development of more complex and resource-hungry applications, which were in turn harder to develop, maintain, and deploy. Suddenly the classic multitiered architecture wasn’t enough anymore. In response, business logic started to get decomposed into subcomponents that could be developed, maintained, and deployed independently, ushering in the age of microservices.

In 2006, Amazon launched Amazon Web Services (AWS), which included the Elastic Compute Cloud (EC2) service. Although AWS wasn’t the first infrastructure as a service (IaaS) offering, it revolutionized the on-demand availability of data storage and computing resources, bringing Cloud Computing—and the ability to quickly scale—to the masses, catalyzing a massive migration of resources into “the cloud.”

Unfortunately, organizations soon learned that life at scale isn’t easy. Bad things happen, and when you’re working with hundreds or thousands of resources (or more!), bad things happen a lot. Traffic will wildly spike up or down, essential hardware will fail, upstream dependencies will become suddenly and inexplicably inaccessible. Even if nothing goes wrong for a while, you still have to deploy and manage all of these resources. At this scale, it’s impossible (or at least wildly impractical) for humans to keep up with all of these issues manually.

What Is Cloud Native?

Fundamentally, a truly cloud native application incorporates everything we’ve learned about running networked applications at scale over the past 60 years. They are scalable in the face of wildly changing load, resilient in the face of environmental uncertainty, and manageable in the face of ever-changing requirements. In other words, a cloud native application is built for life in a cruel, uncertain universe.

But how do we define the term cloud native? Fortunately for all of us,5 we don’t have to. The Cloud Native Computing Foundation—a subfoundation of the renowned Linux Foundation, and something of an acknowledged authority on the subject—has already done it for us:

Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds….

These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.6

Cloud Native Computing Foundation, CNCF Cloud Native Definition v1.0

By this definition, cloud native applications are more than just applications that happen to live in a cloud. They’re also scalable, loosely coupled, resilient, manageable, and observable. Taken together, these “cloud native attributes” can be said to constitute the foundation of what it means for a system to be cloud native.

As it turns out, each of those words has a pretty specific meaning of its own, so let’s take a look.

Scalability

In the context of cloud computing, scalability can be defined as the ability of a system to continue to behave as expected in the face of significant upward or downward changes in demand. A system can be considered to be scalable if it doesn’t need to be refactored to perform its intended function during or after a steep increase in demand.

Because unscalable services can seem to function perfectly well under initial conditions, scalability isn’t always a primary consideration during service design. While this might be fine in the short term, services that aren’t capable of growing much beyond their original expectations also have a limited lifetime value. What’s more, it’s often fiendishly difficult to refactor a service for scalability, so building with it in mind can save both time and money in the long run.

There are two different ways that a service can be scaled, each with its own associated pros and cons:

Vertical scaling

A system can be vertically scaled (or scaled up) by upsizing (or downsizing) the hardware resources that are already allocated to it. For example, by adding memory or CPU to a database that’s running on a dedicated computing instance. Vertical scaling has the benefit of being technically relatively straightforward, but any given instance can only be upsized so much.

Horizontal scaling

A system can be horizontally scaled (or scaled out) by adding (or removing) service instances. For example, this can be done by increasing the number of service nodes behind a load balancer or containers in Kubernetes, or another container orchestration system. This strategy has a number of advantages, including redundancy and freedom from the limits of available instance sizes. However, more replicas mean greater design and management complexity, and not all services can be horizontally scaled.

Given that there are two ways of scaling a service—up or out—does that mean that any service whose hardware can be upscaled (and is capable of taking advantage of increased hardware resources) is “scalable”? If you want to split hairs, then sure, to a point. But how scalable is it? Vertical scaling is inherently limited by the size of available computing resources, so a service that can only be scaled up isn’t very scalable at all. If you want to be able to scale by ten times, or a hundred, or a thousand, your service really has to be horizontally scalable.

So what’s the difference between a service that’s horizontally scalable and one that’s not? It all boils down to one thing: state. A service that doesn’t maintain any application state—or which has been very carefully designed to distribute its state between service replicas—will be relatively straightforward to scale out. For any other application, it will be hard. It’s that simple.

The concepts of scalability, state, and redundancy will be discussed in much more depth in Chapter 7.

Loose Coupling

Loose coupling is a system property and design strategy in which a system’s components have minimal knowledge of any other components. Two systems can be said to be loosely coupled when changes to one component generally don’t require changes to the other.

For example, web servers and web browsers can be considered to be loosely coupled: servers can be updated or even completely replaced without affecting our browsers at all. In their case, this is possible because standard web servers have agreed that they would communicate using a set of standard protocols.7 In other words, they provide a service contract. Imagine the chaos if all the world’s web browsers had to be updated each time NGINX or httpd had a new version!8

It could be said that “loose coupling” is just a restatement of the whole point of microservice architectures: to partition components so that changes in one don’t necessarily affect another. This might even be true. However, this principle is often neglected, and bears repeating. The benefits of loose coupling—and the consequences if it’s neglected—cannot be understated. It’s very easy to create a “worst of all worlds” system that pairs the management and complexity overhead of having multiple services with the dependencies and entanglements of a monolithic system: the dreaded distributed monolith.

Unfortunately, there’s no magic technology or protocol that can keep your services from being tightly coupled. Any data exchange format can be misused. There are, however, several that help, and—when applied with practices like declarative APIs and good versioning practices—can be used to create services that are both loosely-coupled and modifiable.

These technologies and practices will be discussed and demonstrated in detail in Chapter 8.

Resilience

Resilience (roughly synonymous with fault tolerance) is a measure of how well a system withstands and recovers from errors and faults. A system can be considered resilient if it can continue operating correctly—possibly at a reduced level—rather than failing completely when some part of the system fails.

When we discuss resilience (and the other the other “cloud native attributes” as well, but especially when we discuss resilience) we use the word “system” quite a lot. A system, depending on how it’s used, can refer to anything from a complex web of interconnected services (such as an entire distributed application), to a collection of closely related components (such as the replicas of a single function or service instance), or a single process running on a single machine. Every system is composed of several subsystems, which in turn are composed of sub-subsystems, which are themselves composed of sub-sub-subsystems. It’s turtles all the way down.

In the language of systems engineering, any system can contain defects, or faults, which we lovingly refer to as bugs in the software world. As we all know too well, under certain conditions, any fault can give rise to an error, which is the name we give to any discrepancy between a system’s intended behavior and its actual behavior. Errors have the potential to cause a system to fail to perform its required function: a failure. It doesn’t stop there though: a failure in a subsystem or component becomes a fault in the larger system; any fault that isn’t properly contained has the potential to cascade upwards until it causes a total system failure.

In an ideal world, every system would be carefully designed to prevent faults from ever occurring, but this is an unrealistic goal. You can’t prevent every possible fault, and it’s wasteful and unproductive to try. However, by assuming that all of a system’s components are certain to fail—which they are—and designing them to respond to potential faults and limit the effects of failures, you can produce a system that’s functionally healthy even when some of its components are not.

There are many ways of designing a system for resiliency. Deploying redundant components is perhaps the most common approach, but that also assumes that a fault won’t affect all components of the same type. Circuit breakers and retry logic can be included to prevent failures from propagating between components. Faulty components can even be reaped—or can intentionally fail—to benefit the larger system.

We’ll discuss all of these approaches (and more) in much more depth in Chapter 9.

Manageability

A system’s manageability is the ease (or lack thereof) with which its behavior can be modified to keep it secure, running smoothly, and compliant with changing requirements. A system can be considered manageable if it’s possible to sufficiently alter its behavior without having to alter its code.

As a system property, manageability gets a lot less attention than some of the more attention-grabbing attributes like scalability or observability. It’s every bit as critical, though, particularly in complex, distributed systems.

For example, imagine a hypothetical system that includes a service and a database, and that the service refers to the database by a URL. What if you needed to update that service to refer to another database? If the URL was hardcoded you might have to update the code and redeploy, which, depending on the system, might be awkward for its own reasons. Of course, you could update the DNS record to point to the new location, but what if you needed to redeploy a development version of the service, with its own development database?

A manageable system might, for example, represent this value as an easily modified environment variable; if the service that uses it is deployed in Kubernetes, adjustments to its behavior might be a matter of updating a value in a ConfigMap. A more complex system might even provide a declarative API that a developer can use to tell the system what behavior she expects. There’s no single right answer.10

Manageability isn’t limited to configuration changes. It encompasses all possible dimensions of a system’s behavior, be it the ability to activate feature flags, or rotate credentials or TLS certificates, or even (and perhaps especially) deploy or upgrade (or downgrade) system components.

Manageable systems are designed for adaptability, and can be readily adjusted to accommodate changing functional, environmental, or security requirements. Unmanageable systems, on the other hand, tend to be far more brittle, frequently requiring ad hoc—often manual—changes. The overhead involved in managing such systems places fundamental limits on their scalability, availability, and reliability.

The concept of manageability—and some preferred practices for implementing them in Go—will be discussed in much more depth in Chapter 10.

Observability

The observability of a system is a measure of how well its internal states can be inferred from knowledge of its external outputs. A system can be considered observable when it’s possible to quickly and consistently ask novel questions about it with minimal prior knowledge, and without having to reinstrument or build new code.

On its face, this might sound simple enough: just sprinkle in some logging and slap up a couple of dashboards, and your system is observable, right? Almost certainly not. Not with modern, complex systems in which almost any problem is the manifestation of a web of multiple things going wrong simultaneously. The Age of the LAMP Stack is over; things are harder now.

This isn’t to say that metrics, logging, and tracing aren’t important. On the contrary: they represent the building blocks of observability. But their mere existence is not enough: data is not information. They need to be used the right way. They need to be rich. Together, they need to be able to answer questions that you’ve never even thought to ask before.

The ability to detect and debug problems is a fundamental requirement for the maintenance and evolution of a robust system. But in a distributed system it’s often hard enough just figuring out where a problem is. Complex systems are just too…complex. The number of possible failure states for any given system is proportional to the product of the number of possible partial and complete failure states of each of its components, and it’s impossible to predict all of them. The traditional approach of focusing attention on the things we expect to fail simply isn’t enough.

Emerging practices in observability can be seen as the evolution of monitoring. Years of experience with designing, building, and maintaining complex systems have taught us that traditional methods of instrumentation—including but not limited to dashboards, unstructured logs, or alerting on various “known unknowns”—just aren’t up to the challenges presented by modern distributed systems.

Observability is a complex and subtle subject, but, fundamentally, it comes down to this: instrument your systems richly enough and under real enough scenarios so that, in the future, you can answer questions that you haven’t thought to ask yet.

The concept of observability—and some suggestions for implementing it—will be discussed in much more depth in Chapter 11.

Why Is Cloud Native a Thing?

The move towards “cloud native” is an example of architectural and technical adaptation, driven by environmental pressure and selection. It’s evolution—survival of the fittest. Bear with me here; I’m a biologist by training.

Eons ago, in the Dawn of Time,12 applications would be built and deployed (generally by hand) to one or a small number of servers, where they were carefully maintained and nurtured. If they got sick, they were lovingly nursed back to health. If a service went down, you could often fix it with a restart. Observability was shelling into a server to run top and review logs. It was a simpler time.

In 1997, only 11% of people in industrialized countries, and 2% worldwide, were regular internet users. The subsequent years saw exponential growth in internet access and adoption, however, and by 2017 that number had exploded to 81% in industrialized countries and 48% worldwide13—and continues to grow.

All of those users—and their money—applied stress to services, generating significant incentive to scale. What’s more, as user sophistication and dependency on web services grew, so did expectations that their favorite web applications would be both feature-rich and always available.

The result was, and is, a significant evolutionary pressure towards scale, complexity, and dependability. These three attributes don’t play well together, though, and the traditional approaches simply couldn’t, and can’t, keep up. New techniques and practices had to be invented.

Fortunately, the introduction of public clouds and IaaS made it relatively straightforward to scale infrastructure out. Shortcomings with dependability could often be compensated for with sheer numbers. But that introduced new problems. How do you maintain a hundred servers? A thousand? Ten thousand? How do you install your application onto them, or upgrade it? How do you debug it when it misbehaves? How do you even know it’s healthy? Problems that are merely annoying at small scale tend to become very hard at large scale.

Cloud native is a thing because scale is the cause of (and solution to) all our problems. It’s not magic. It’s not special. All fancy language aside, cloud native techniques and technologies exist for no other reasons than to make it possible to leverage the benefits of a “cloud” (quantity) while compensating for its downsides (lack of dependability).

Summary

In this chapter, we talked a fair amount about the history of computing, and how what we now call “cloud native” isn’t a new phenomenon so much as the inevitable outcome of a virtuous cycle of technological demand driving innovation driving more demand.

Ultimately, though, all of those fancy words distill down to a single point: today’s applications have to dependably serve a lot of people. The techniques and technologies that we call “cloud native” represent the best current practices for building a service that’s scalable, adaptable, and resilient enough to do that.

But what does all of this to do with Go? As it turns out, cloud native infrastructure requires cloud native tools. In Chapter 2, we’ll start to talk about what that means, exactly.

1 Surden, Esther. “Privacy Laws May Usher in Defensive DP: Hopper.” Computerworld, 26 Jan. 1976, p. 9.

2 Which is Go. Don’t get me wrong—this is still a Go book after all.

3 A “kludge” is “an awkward or inelegant solution.” It’s a fascinating word with a fascinating history.

4 Have you ever wondered why so many Kubernetes migrations fail?

5 Especially for me. I get to write this cool book.

6 Cloud Native Computing Foundation. “CNCF Cloud Native Definition v1.0,” GitHub, 7 Dec. 2020. https://oreil.ly/KJuTr.

7 Those of us who remember the Browser Wars of the 1990s will recall that this wasn’t always strictly true.

8 Or if every website required a different browser. That would stink, wouldn’t it?

9 If you’re interested in a complete academic treatment, I highly recommend Reliability and Availability Engineering by Kishor S. Trivedi and Andrea Bobbio.

10 There are some wrong ones though.

11 Plus, they both start with M. Super confusing.

12 That time was the 1990s.

13 International Telecommunication Union (ITU). “Internet users per 100 inhabitants 1997 to 2007” and “Internet users per 100 inhabitants 2005 to 2017.” ICT Data and Statistics (IDS).

Get Cloud Native Go now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.