Chapter 1. Service Mesh 101

Linkerd is the first service mesh—in fact, it’s the project that coined the term “service mesh.” It was created in 2015 by Buoyant, Inc., as we’ll discuss more in Chapter 2, and for all that time it’s been focused on making it easier to produce and operate truly excellent cloud native software.

But what, exactly, is a service mesh? We can start with the definition from the CNCF Glossary:

In a microservices world, apps are broken down into multiple smaller services that communicate over a network. Just like your wifi network, computer networks are intrinsically unreliable, hackable, and often slow. Service meshes address this new set of challenges by managing traffic (i.e., communication) between services and adding reliability, observability, and security features uniformly across all services.

The cloud native world is all about computing at a huge range of scales, from tiny clusters running on your laptop for development up through the kind of massive infrastructure that Google and Amazon wrangle. This works best when applications use the microservices architecture, but the microservices architecture is inherently more fragile than a monolithic architecture.

Fundamentally, service meshes are about hiding that fragility from the application developer—and, indeed, from the application itself. They do this by taking several features that are critical when creating robust applications and moving them from the application into the infrastructure. This allows application developers to focus on what makes their applications unique, rather than having to spend all their time worrying about how to provide the critical functions that should be the same across all applications.

In this chapter, we’ll take a high-level look at what service meshes do, how they work, and why they’re important. In the process, we’ll provide the background you need for our more detailed discussions about Linkerd in the rest of the book.

Basic Mesh Functionality

The critical functions provided by services meshes fall into three broad categories: security, reliability, and observability. As we examine these three categories, we’ll be comparing the way they play out in a typical monolith and in a microservices application.

Of course, “monolith” can mean several different things. Figure 1-1 shows a diagram of the “typical” monolithic application that we’ll be considering.

luar 0101
Figure 1-1. A monolithic application

The monolith is a single process within the operating system, which means that it gets to take advantage of all the protection mechanisms offered by the operating system; other processes can’t see anything inside the monolith, and they definitely can’t modify anything inside it. Communications between different parts of the monolith are typically function calls within the monolith’s single memory space, so again there’s no opportunity for any other process to see or alter these communications. It’s true that one area of the monolith can alter the memory in use by other parts—in fact, this is a huge source of bugs!—but these are generally just errors, rather than attacks.

Multiple Processes Versus Multiple Machines

“But wait!” we hear you cry. “Any operating system worthy of the name can provide protections that do span more than one process! What about memory-mapped files or System V shared memory segments? What about the loopback interface and Unix domain sockets (to stretch the point a bit)?”

You’re right: these mechanisms can allow multiple processes to cooperate and share information while still being protected by the operating system. However, they must be explicitly coded into the application, and they only function on a single machine. Part of the power of cloud native orchestration systems like Kubernetes is that they’re allowed to schedule Pods on any machine in your cluster, and you won’t know which machine ahead of time. This is tremendously flexible, but it also means that mechanisms that assume everything is on a single machine simply won’t work in the cloud native world.

In contrast, Figure 1-2 shows the corresponding microservices application.

luar 0102
Figure 1-2. A microservices application

With microservices, things are different. Each microservice is a separate process, and microservices communicate only over the network—but the protection mechanisms provided by the operating system function only inside a process. These mechanisms aren’t enough in a world where any information shared between microservices has to travel over the network.

This reliance on communications over the unreliable, insecure network raises a lot of concerns when developing microservices applications.

Security

Let’s start with the fact that the network is inherently insecure. This gives rise to a number of possible issues, some of which are shown in Figure 1-3.

luar 0103
Figure 1-3. Communication is a risky business

Some of the most significant security issues are eavesdropping, tampering, identity theft, and overreach:

Eavesdropping

Evildoers may be able to intercept communications between two microservices, reading communications not intended for them. Depending on what exactly an evildoer learns, this could be a minor annoyance or a major disaster.

The typical protection against eavesdropping is encryption, which scrambles the data so that only the intended recipient can understand it.

Tampering

An evildoer might also be able to modify the data in transit over the network. At its simplest, the tampering attack would simply corrupt the data in transit; at its most subtle, it would modify the data to be advantageous to the attacker.

It’s extremely important to understand that encryption alone will not protect against tampering! The proper protection is to use integrity checks like checksums; all well-designed cryptosystems include integrity checks as part of their protocols.

Identity theft

When you hand off credit card details to your payment microservice, how do you know for certain that you’re really talking to your payment microservice? If an evildoer can successfully pretend to be one of your microservices, that opens the door to all manner of troublesome possibilities.

Strong authentication is critical to protect against this type of attack. It’s the only way to be sure that the microservice you’re talking to is really the one you think it is.

Overreach

On the flip side of identity theft, an evildoer may be able to take advantage of a place where a microservice is allowed to do things that it simply shouldn’t be allowed to do. Imagine, for example, an evildoer finding that the payment microservice is perfectly happy to accept requests from the microservice that should merely be listing things for sale.

Careful attention to authorization is the key here. In a perfect world, every microservice will be able to do exactly what it needs, and no more (the principle of least privilege).

Reliability

Reliability in the monolith world typically refers to how well the monolith functions: when the different parts of the monolith communicate with function calls, you don’t typically have to worry about a call getting lost or about one of your functions suddenly becoming unresponsive! But, as shown in Figure 1-4, unreliable communications are actually the norm with microservices.

luar 0104
Figure 1-4. Unreliable communications are the norm

There are quite a few ways microservices can be unreliable, including:

Request failure

Sometimes requests made over the network fail. There may be any number of possible reasons, ranging from a crashed microservice to a network overload or partition. Either the application or the infrastructure needs to do something to deal with the request that failed.

In the simplest case, the mesh can simply manage retries for the application: if the call fails because the called service dies or times out, just resend the request. This won’t always work, of course: not all requests are safe to retry, and not every failure is transient. But in many cases, simple retry logic can be used to great effect.

Service failure

A special case of request failures comes up when it isn’t just a single instance of a microservice that crashes, but all instances. Maybe a bad version was deployed, or maybe an entire cluster crashed. In these cases the mesh can help by failing over to a backup cluster or to a known-good implementation of the service.

Again, this can’t always happen without application help (failover of stateful services can be quite complex, for example). But microservices are often designed to manage without state, in which case mesh failover can be a huge help.

Service overload

Another special case: sometimes the failure happens because too many requests are piling onto the same service. In these cases, circuit breaking can help avoid a cascade failure: if the mesh fails some requests quickly, before dependent services get involved and cause further trouble, it can help limit the damage. This is a bit of a drastic technique, but this type of enforced load shedding can dramatically increase the overall reliability of the application as a whole.

Observability

It’s difficult to see what’s going on in any computing application: even a slow machine, these days, operates on time scales a billion times faster than the one we humans live by! Within a monolith, observability is often handled by internal logging or dashboards that collect global metrics from many different areas of the monolith. This is much less feasible with a microservices architecture, as we see in Figure 1-5—and even if it were feasible, it wouldn’t tell the whole story.

luar 0105
Figure 1-5. It’s hard to work in the dark

In the microservices world, “observability” tends to focus more on the call graph and the golden metrics:

The call graph

When looking at a microservices application, the first critical thing is usually knowing which services are getting called by which other services. This is the call graph, shown in Figure 1-6, and a critical thing that a service mesh can do is to provide metrics about how much traffic is going over each edge of the graph, how much is succeeding, how much is failing, etc.

luar 0106
Figure 1-6. The call graph of an application

The call graph is a critical starting point because problems that the user sees from outside the cluster may actually be caused by problems with a single service buried deep in the graph. It’s very important to have visibility into the whole graph to be able to solve problems.

It’s also worth noting that, in specific situations, particular paths through the graph will be relevant, as shown in Figure 1-7. For example, different requests from the user may use different paths in the graph, exercising different aspects of the workloads.

luar 0107
Figure 1-7. Different paths through the call graph
The golden metrics

There are a great many metrics that we could collect for every microservice. Over time, three of them have repeatedly proven especially useful in a wide variety of situations, so much so that we now refer to them as the “golden metrics” (as shown in Figure 1-8):

Latency

How long are requests taking to complete? This is typically reported as an amount of time for a certain percentage of requests to complete. For example, P95 latency indicates the time in which 95% of requests complete, so you can interpret “5 ms P95” to mean that 95% of requests complete in 5 ms or less.

Traffic

How many requests is a given service handling? This is typically reported as requests per second, or RPS.

Success rate

How many requests are succeeding? (This can also be reported as its inverse, the error rate.) This is typically reported as a percentage of total requests, with “success rate” often abbreviated as SR.

luar 0108
Figure 1-8. The three golden metrics

The Original “Golden Signals”

These were originally described in Google’s “Monitoring Distributed Systems” post as the four “golden signals”: latency, request rate, error rate, and saturation. We prefer “golden metrics” because metrics are things you can directly measure; you derive signals (like “saturation”) from metrics.

We’ll discuss these in much greater detail in Chapter 10, but it’s worth noting at this point that these metrics have proven so useful that many meshes devote considerable effort to recording them—and that the service mesh is an ideal place to track them.

How Do Meshes Actually Work?

Finally, let’s take a quick look at how service meshes actually function.

At a high level, all meshes are fundamentally doing the same job: they insert themselves into the operating system’s network stack, take over the low-level networking that the application is using, and mediate everything the application does on the network. This is the only practical way to allow the mesh to provide all the functionality it’s designed to provide without requiring changes to the application itself.

Most meshes—including Linkerd—use the sidecar model of injecting a proxy container next to every application container (see Figure 1-9).1 Once running, the proxy reconfigures the host’s network routing rules so that all traffic into and out of the application container goes through the proxy. This allows the proxy to control everything necessary for the functionality of the mesh.

luar 0109
Figure 1-9. Linkerd and the sidecar model

There are other models, but the sidecar model has tremendous advantages in terms of operational simplicity and security:

  • From the perspective of basically everything else in the system, the sidecar acts like it is part of the application. In particular, this means that all the things that the operating system does to guarantee the safety of the application just work for the sidecar, too. This is a very, very important characteristic: limiting the sidecar to exactly one security context sharply limits the attack surface of the sidecar and makes it much easier to reason about whether the things the sidecar is doing are safe.

  • In much the same way, managing the sidecar is exactly the same as managing any other application or service. For example, kubectl rollout restart will just work to restart an application Pod and its sidecar as a unit.

There are disadvantages too, of course. The biggest is that every application Pod needs a sidecar container—even if your application has thousands of Pods. Another common concern is around latency: the sidecar, by definition, requires some time to process network traffic. Again, we’ll talk more about this later, but it’s worth noting up front that Linkerd goes to a lot of trouble to minimize the sidecar’s impact, and in practice Linkerd is very fast and very lightweight.

So Why Do We Need This?

Put bluntly, the functionality provided by the mesh is not optional. You’re never going to hear the engineering team say “oh, we don’t need security” or “oh, reliability isn’t important” (though you might have to convince people of the need for observability—hopefully this book will help!).

In other words, the choice isn’t between having these three features or not: it’s between having them provided by the mesh or needing to provide them in the application.

Providing them in the application is costly. Your developers could write them by hand, but this means a lot of fiddly application code replicated in every microservice, which is very easy to get wrong (especially since the temptation will always be to have senior developers focus on the crown jewels of logic specific to your business, rather than the dreary, less visible, but equally critical work of getting retries right). You may also run into incompatibilities between parts of the application, especially as the application grows.

Alternatively, you could find libraries that implement the functionality for you, which definitely saves development time. On the other hand, you still end up with each and every one of your developers needing to learn how to use those libraries, you’re limited to languages and runtimes for which you can find the libraries, and incompatibilities are still a serious issue (suppose one microservice upgrades the library before another one does).

Over time, it’s become pretty clear to us that pushing all this functionality into the mesh, where the application developers don’t even necessarily need to know that it exists, is the smart way to provide it—and we think that Linkerd is the best of the meshes out there. If we haven’t convinced you, too, by the end of the book, please reach out and let us know where we fell short!

Summary

In summary, service meshes are platform-level infrastructure that provide security, reliability, and observability uniformly across an entire application, without requiring changes to the application itself. Linkerd was the first service mesh, and we think it’s still the one with the best balance of power, speed, and operational simplicity.

1 The name comes from the analogy of bolting a sidecar onto a motorcycle.

Get Linkerd: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.