Chapter 1. Service Mesh 101
To get started on your service mesh journey, you need to know three things: what a service mesh is, how it works, and why you should use it (and when you should not).
There is no universally accepted definition for a service mesh, but I define it as follows:
A service mesh is an infrastructure layer that enables you to control the network communication of your workloads from a single control plane.
We can break that definition down into parts to better understand it:
By infrastructure layer, I mean that a service mesh is not part of your services; it is deployed and operated independently. Since it is not aware of service-specific business logic, but it affects every service, it is considered infrastructure or middleware.
Figure 1-1 shows a typical software stack. Services and applications run on top of infrastructure. Service mesh is at the first infrastructure layer with storage, metrics, and other higher-level infrastructure requirements. Under that is VMs, Kubernetes, or any compute provider or orchestrator where everything runs. At the bottom is actual hardware (bare metal).
By control the network communication of your workloads, I mean that a service mesh controls the traffic entering and leaving a microservice, database, or anything else that does network communication. For example, a service mesh might disallow incoming traffic based on a rule (such as it’s missing a required header), or it might encrypt outgoing traffic. A service mesh has complete control over all traffic entering and leaving the services.
Finally, by from a single control plane, I mean a single location from which service mesh operators can interact with the service mesh. Suppose operators want to change the configuration for multiple services. In that case, they don’t need to reconfigure a dozen subsystems or modify the services themselves; instead, they configure the service mesh once, and it handles propagating out all changes.
Hopefully, this definition gives you some idea of what a service mesh is, but I often find that I need to understand how something actually works before I fully grasp what it is.
How a Service Mesh Works
A service mesh is made up of sidecar proxies and the control plane.
Sidecar Proxies
A proxy is an application that traffic is routed through on the way to its destination. Popular proxies you may have heard of are NGINX, HAProxy, and Envoy. In most service meshes, all service traffic (inbound and outbound) is routed through a local proxy dedicated to each service instance.1
Figure 1-2 shows what a service mesh looks like with two service instances: frontend
and backend
. When frontend
calls backend
, frontend
’s local proxy captures the outgoing request. frontend
’s proxy then forwards the request to the backend
service. When the request reaches the backend
service, again, it is captured by backend
’s local proxy and inspected. If the request is allowed, backend
’s proxy forwards it to the actual backend
service.
Each instance of a service must be deployed with its own local proxy.2 The pattern of deploying a helper application—in this case a proxy—alongside the main service is known as the sidecar pattern, and so local proxies are referred to as sidecar proxies.
Sidecar proxies are a vital component of the service mesh because they enable control of service traffic without modifying or redeploying the underlying services. Since sidecar proxies run as separate processes from the services, they can be reconfigured without affecting the services. For example, the backend
service’s sidecar proxy from Figure 1-2 could be reconfigured to refuse traffic from the frontend
service without changing code or redeploying the backend
service itself.
What handles reconfiguring proxies? The control plane.
Control Plane
The control plane’s job is to manage and configure the sidecar proxies. As you can see in Figure 1-3, the control plane is a separate service that must be deployed on its own; it is not deployed as a sidecar. The control plane is where most of the complex logic of the service mesh lives: it must watch for services starting and stopping, sign and distribute certificates, reconfigure proxies, etc. The sidecar proxies themselves are relatively simple: they receive configuration from the control plane detailing which actions to perform on traffic, and they perform those actions.
If we go back to my definition of a service mesh—“an infrastructure layer that enables you to control the network communication of your workloads from a single control plane”—you can now see how the proxies and control plane fit in.
The control plane is the single location service mesh operators interact with. In turn, it configures the proxies that control network communication. Together, the control plane and proxies make up the infrastructure layer.
Concrete Example
Let’s go through a concrete example to show how a service mesh works in practice. Figure 1-4 shows the architecture for this example.
When frontend
calls backend
, frontend
’s sidecar proxy captures the request. In Figure 1-5, the service mesh has configured the frontend
’s proxy to pass traffic through to the backend
service without modification. The sidecar proxy running alongside backend
captures the incoming traffic and forwards the request to the actual backend
service instance. The backend
service instance processes the request and sends a response that returns along the same path.
Now imagine that you’ve got a new requirement to get metrics on how many requests per second frontend
is making to backend
. You could make changes to the code of frontend
and backend
to emit these metrics, but with the service mesh in place, there’s a simpler way, as shown in Figure 1-6. First, you configure the control plane with the URL of your metrics database (step 1). Immediately, the control plane reconfigures both sidecar proxies and instructs them to emit metrics (step 2). Now when frontend
calls backend
(step 3), each proxy emits metrics to the metrics database (step 4), and you can see the requests per second in your metrics dashboard.
Notice that you didn’t have to change the code of either service nor did you need to redeploy anything. With a single configuration change, you immediately got metrics for the frontend
and backend
services.
This concrete example should help you understand how a service mesh works in practice, but it is a simplified picture. In a typical service mesh deployment, the control plane manages hundreds of services and workloads, so the architecture looks more like Figure 1-7. It looks like a mesh, hence the name!
With a larger mesh, you can see how being able to control networking across all of these services from a single location without redeploying any services or changing code is incredibly powerful. That brings us to why you would use a service mesh.
Why Use a Service Mesh
A service mesh provides features in four areas: security, observability, reliability, and traffic control. The fundamental value proposition of a service mesh is the ability to provide these features across every service and workload without modifying service code.
In the following sections, I will expand upon these areas, but it’s important to note that the features a service mesh provides can also be implemented in service code! The question to ask is that if these features can be implemented in service code, why deploy a service mesh at all? The answer is that past a certain scale, recoding every service is more costly to engineering time than running a service mesh. This will be addressed more fully in “When to Use a Service Mesh”.
Security
One of the primary reasons companies deploy service meshes is to secure their networks. Typically this means encrypting traffic between all workloads and implementing authentication and authorization.
Solving this problem can be very difficult in a microservices architecture without a service mesh. Requiring every request to be encrypted means provisioning Transport Layer Security (TLS) certificates to every service in a secure way and managing your own certificate signing infrastructure. Authenticating and authorizing every request means updating and maintaining authentication code in every service.
A service mesh makes this work much easier because it can issue certificates and configure sidecar proxies to encrypt traffic and perform authorization—all without any changes to the underlying services (see Figure 1-8).
Observability
Observability is the ability to understand what’s happening to your services while they’re running. Observability data is essential for understanding microservices architectures and diagnosing failure, but it can be challenging to configure all your services to emit metrics and other data in a unified way.
Capturing observability data is the perfect job for a service mesh because all requests flow through its proxies. The service mesh can configure its proxies to emit metrics across all your services in a consistent format without modifying or redeploying the underlying services.
Reliability
In distributed systems, there’s often something failing. Building reliable distributed systems means reducing failure where possible and handling failure gracefully when it inevitably happens.
Reducing failure might mean implementing health checking so that traffic is only sent to healthy services. Handling failure might mean retrying requests that failed (see Figure 1-9) or implementing a timeout, so a service doesn’t wait forever for a response.
Implementing these techniques in code is time-consuming, error-prone, and difficult to do in a consistent way across all your services. With a service mesh, the proxies can perform these techniques for any of your services—all you need to do is interact with the control plane. You can also adjust the settings in real time as service loads change.
Traffic Control
Traffic control is about controlling where traffic between services is routed. Traffic control solves many problems:
-
Implementing deployment strategies such as canary deployments, in which a small amount of “canary” traffic is routed to the new version of a service to see if it’s working before fully rolling out the new version
-
Monolith to microservices migrations, in which services are split off from the monolith and traffic previously routed to the monolith is seamlessly rerouted to the new microservices
-
Multi-cluster failover, in which traffic is routed to services in other healthy clusters if the local cluster is down
Features in Combination
Now you should understand the types of features a service mesh provides around security, observability, reliability, and traffic control. Alone, these features are helpful, but they are even more powerful in combination.
For example, observability data provided by the service mesh can be combined with reliability and traffic control features. If the mesh detects that a service instance is returning errors, it can redirect traffic to healthy instances or a whole different cluster. Or mesh security features can be combined with the observability features to detect when a service is attempting to make requests it’s not authorized to make—potentially indicating a security breach. When you deploy Consul yourself, you’ll see many use cases where you can combine the service mesh features.
If your organization needs these features, you must decide whether it’s worth the additional complexity of a service mesh, or if you should implement them in service code. The key to answering that question is examining your scale.
When to Use a Service Mesh
There is no doubt that deploying a service mesh adds additional complexity. You now have sidecar proxies and the service mesh control plane to manage. In addition, you will need more compute resources (CPU and memory) to run the proxies and control plane, and now all traffic takes an extra hop through the local sidecar proxies, which adds latency. Implementing service mesh features in code would save resources and reduce infrastructure complexity (although it would add code complexity). For a service mesh to be worth it, it must provide a lot of value to your organization.
A simple formula for knowing when to use a service mesh is when you (a) need to solve networking problems in the areas outlined previously (security, observability, reliability, and traffic control) and (b) your organization is at a scale, or will soon be at a scale, where it’s too costly to solve those problems in service code.
For example, say your organization is moving to what’s known as a zero trust security architecture where all internal traffic is encrypted, authenticated, and authorized. If you’re only running two microservices, you can easily recode those services. However, if you’re running 400 microservices, then it’s unlikely that you’ll be able to recode all those services in a reasonable amount of time. In this case, a service mesh makes a lot of sense.
In addition, at a certain scale, there will be services and workloads that you want control over where you don’t actually have the ability to edit their code. For example, maybe you’re deploying a packaged open source software, or perhaps you’re using a cloud-managed database. Ideally, you would have the same control over those workloads that you do over your other services.
In the end, the exact scale at which it makes sense to use a service mesh will depend on your specific organization and the problems you’re trying to solve. I hope that this book will help you understand the problems a service mesh solves and help you gauge whether it makes sense in your situation.
Summary
In this chapter, you learned what a service mesh is, how it works, and why you’d use one.
I introduced my definition of a service mesh:
An infrastructure layer that enables you to control the network communication of your workloads from a single control plane.
And I discussed how the two components of a service mesh, the proxies and the control plane, enable the control of network communication. You walked through a real-life example of a working service mesh, and I discussed the four categories of service mesh features: security, observability, reliability, and traffic control.
Finally, I addressed when you should use a service mesh: when you need these features and you’re at a scale where it’s too costly to implement them in service code.
So far, everything discussed in this chapter has been applicable to most service meshes and not specific to Consul. The next chapter is devoted to Consul in particular. You’ll learn about how it works, its architecture, the protocols it uses, and what makes it unique.
1 Some meshes use other technology such as iptables or eBPF to control traffic rather than a separate proxy process.
2 If it’s impossible to deploy a local proxy—for example, with a managed service such as Amazon Relational Database Service—you can use a terminating gateway as covered in Chapter 10.
Get Consul: Up and Running now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.