Chapter 1. Service Mesh 101

To get started on your service mesh journey, you need to know three things: what a service mesh is, how it works, and why you should use it (and when you should not).

There is no universally accepted definition for a service mesh, but I define it as follows:

A service mesh is an infrastructure layer that enables you to control the network communication of your workloads from a single control plane.

We can break that definition down into parts to better understand it:

By infrastructure layer, I mean that a service mesh is not part of your services; it is deployed and operated independently. Since it is not aware of service-specific business logic, but it affects every service, it is considered infrastructure or middleware.

Figure 1-1 shows a typical software stack. Services and applications run on top of infrastructure. Service mesh is at the first infrastructure layer with storage, metrics, and other higher-level infrastructure requirements. Under that is VMs, Kubernetes, or any compute provider or orchestrator where everything runs. At the bottom is actual hardware (bare metal).

By control the network communication of your workloads, I mean that a service mesh controls the traffic entering and leaving a microservice, database, or anything else that does network communication. For example, a service mesh might disallow incoming traffic based on a rule (such as it’s missing a required header), or it might encrypt outgoing traffic. A service mesh has complete control over all traffic entering and leaving the services.

Diagram of where service mesh sits in the stack.

Finally, by from a single control plane, I mean a single location from which service mesh operators can interact with the service mesh. Suppose operators want to change the configuration for multiple services. In that case, they don’t need to reconfigure a dozen subsystems or modify the services themselves; instead, they configure the service mesh once, and it handles propagating out all changes.

Hopefully, this definition gives you some idea of what a service mesh is, but I often find that I need to understand how something actually works before I fully grasp what it is.

How a Service Mesh Works

A service mesh is made up of sidecar proxies and the control plane.

Sidecar Proxies

A proxy is an application that traffic is routed through on the way to its destination. Popular proxies you may have heard of are NGINX, HAProxy, and Envoy. In most service meshes, all service traffic (inbound and outbound) is routed through a local proxy dedicated to each service instance.¹

Figure 1-2 shows what a service mesh looks like with two service instances: frontend and backend. When frontend calls backend, frontend’s local proxy captures the outgoing request. frontend’s proxy then forwards the request to the backend service. When the request reaches the backend service, again, it is captured by backend’s local proxy and inspected. If the request is allowed, backend’s proxy forwards it to the actual backend service.

Diagram showing services communicating through proxies.

Each instance of a service must be deployed with its own local proxy.² The pattern of deploying a helper application—in this case a proxy—alongside the main service is known as the sidecar pattern, and so local proxies are referred to as sidecar proxies.

Sidecar proxies are a vital component of the service mesh because they enable control of service traffic without modifying or redeploying the underlying services. Since sidecar proxies run as separate processes from the services, they can be reconfigured without affecting the services. For example, the backend service’s sidecar proxy from Figure 1-2 could be reconfigured to refuse traffic from the frontend service without changing code or redeploying the backend service itself.

What handles reconfiguring proxies? The control plane.

Control Plane

The control plane’s job is to manage and configure the sidecar proxies. As you can see in Figure 1-3, the control plane is a separate service that must be deployed on its own; it is not deployed as a sidecar. The control plane is where most of the complex logic of the service mesh lives: it must watch for services starting and stopping, sign and distribute certificates, reconfigure proxies, etc. The sidecar proxies themselves are relatively simple: they receive configuration from the control plane detailing which actions to perform on traffic, and they perform those actions.

If we go back to my definition of a service mesh—“an infrastructure layer that enables you to control the network communication of your workloads from a single control plane”—you can now see how the proxies and control plane fit in.

The control plane is the single location service mesh operators interact with. In turn, it configures the proxies that control network communication. Together, the control plane and proxies make up the infrastructure layer.

Concrete Example

Let’s go through a concrete example to show how a service mesh works in practice. Figure 1-4 shows the architecture for this example.

Diagram showing the service mesh architecture.

When frontend calls backend, frontend’s sidecar proxy captures the request. In Figure 1-5, the service mesh has configured the frontend’s proxy to pass traffic through to the backend service without modification. The sidecar proxy running alongside backend captures the incoming traffic and forwards the request to the actual backend service instance. The backend service instance processes the request and sends a response that returns along the same path.

Diagram showing the frontend service calling the backend service.

Now imagine that you’ve got a new requirement to get metrics on how many requests per second frontend is making to backend. You could make changes to the code of frontend and backend to emit these metrics, but with the service mesh in place, there’s a simpler way, as shown in Figure 1-6. First, you configure the control plane with the URL of your metrics database (step 1). Immediately, the control plane reconfigures both sidecar proxies and instructs them to emit metrics (step 2). Now when frontend calls backend (step 3), each proxy emits metrics to the metrics database (step 4), and you can see the requests per second in your metrics dashboard.

Diagram showing the control plane configuring proxies to emit metrics.

Notice that you didn’t have to change the code of either service nor did you need to redeploy anything. With a single configuration change, you immediately got metrics for the frontend and backend services.

This concrete example should help you understand how a service mesh works in practice, but it is a simplified picture. In a typical service mesh deployment, the control plane manages hundreds of services and workloads, so the architecture looks more like Figure 1-7. It looks like a mesh, hence the name!

With a larger mesh, you can see how being able to control networking across all of these services from a single location without redeploying any services or changing code is incredibly powerful. That brings us to why you would use a service mesh.

Why Use a Service Mesh

A service mesh provides features in four areas: security, observability, reliability, and traffic control. The fundamental value proposition of a service mesh is the ability to provide these features across every service and workload without modifying service code.

In the following sections, I will expand upon these areas, but it’s important to note that the features a service mesh provides can also be implemented in service code! The question to ask is that if these features can be implemented in service code, why deploy a service mesh at all? The answer is that past a certain scale, recoding every service is more costly to engineering time than running a service mesh. This will be addressed more fully in “When to Use a Service Mesh”.

Security

One of the primary reasons companies deploy service meshes is to secure their networks. Typically this means encrypting traffic between all workloads and implementing authentication and authorization.

Solving this problem can be very difficult in a microservices architecture without a service mesh. Requiring every request to be encrypted means provisioning Transport Layer Security (TLS) certificates to every service in a secure way and managing your own certificate signing infrastructure. Authenticating and authorizing every request means updating and maintaining authentication code in every service.

A service mesh makes this work much easier because it can issue certificates and configure sidecar proxies to encrypt traffic and perform authorization—all without any changes to the underlying services (see Figure 1-8).

Security Case Study

Annika is the lead engineer on her platform team. A requirement has come down from the chief information security officer that all traffic between microservices must be TLS encrypted by the end of the year.

Annika knows that modifying every service to support making and receiving requests using TLS is a monumental task. There are 30 development teams that own hundreds of services, some of which haven’t been updated for years. Getting space on each team’s roadmap to update every service they own will be extremely difficult given all the feature work planned for the year. And even if they could convince the teams to modify and redeploy all their services, they’d still need to build a tool to distribute and rotate the TLS certificates securely.

Luckily, Annika has heard about service meshes. Annika does not need the teams to update any service code with a service mesh. Instead, the sidecar proxies can automatically perform encryption and decryption without the underlying services even being aware. The service mesh also handles securely distributing and rotating TLS certificates. The platform team just needs to deploy the proxies alongside each service; Annika is confident her team can roll that out in a couple of months.

Observability

Observability is the ability to understand what’s happening to your services while they’re running. Observability data is essential for understanding microservices architectures and diagnosing failure, but it can be challenging to configure all your services to emit metrics and other data in a unified way.

Capturing observability data is the perfect job for a service mesh because all requests flow through its proxies. The service mesh can configure its proxies to emit metrics across all your services in a consistent format without modifying or redeploying the underlying services.

Observability Case Study

Geordi is in a predicament. He just joined the operations team at a startup founded three years ago that’s been cranking out features at a breakneck pace. The startup is now thriving and is attracting thousands of new users every day. Unfortunately, all these new users have been putting a ton of load on the system, and the operations team has been experiencing outages more and more frequently.

The worst part is that because the development teams built features so quickly, they never added metrics to their microservices. This means that the operations team is flying blind whenever the site goes down. They have to dig through thousands of logs to find out where the issue is and which service is acting up.

Geordi would love to get all the development teams to update their services to add proper metrics around request response times and errors, but that could take months or even a whole year given how much work the development teams currently have.

Instead, Geordi suggests that the team deploy a service mesh. A service mesh will automatically emit detailed metrics for every request in the system—all without modifying the underlying services. The operations team can then build dashboards for every service and will be able to see which services are returning errors and running slowly. They can even build in alerting thresholds to catch issues before they result in a total outage.

Reliability

In distributed systems, there’s often something failing. Building reliable distributed systems means reducing failure where possible and handling failure gracefully when it inevitably happens.

Reducing failure might mean implementing health checking so that traffic is only sent to healthy services. Handling failure might mean retrying requests that failed (see Figure 1-9) or implementing a timeout, so a service doesn’t wait forever for a response.

Implementing these techniques in code is time-consuming, error-prone, and difficult to do in a consistent way across all your services. With a service mesh, the proxies can perform these techniques for any of your services—all you need to do is interact with the control plane. You can also adjust the settings in real time as service loads change.

Reliability Case Study

Miles just awoke to the sound of the PagerDuty app beeping harshly at him. He’s on call, so he rolls groggily out of bed and makes his way to his work bag to pull out his laptop. The alert is that the main home page is down. That’s not good.

He loads the home page himself, and his browser just spins. He looks at the monitoring dashboards and sees that latency is off the charts for requests to the home-page service. It’s taking five minutes for each request! The weird thing is that the CPU load for the nodes that host the home page looks normal. He wonders if perhaps a dependency of the home page is running slow?

His company is running a service mesh, so he opens up its UI and checks the status of the home-page service. He immediately notices that the analytics service is listed as a dependency of the home-page service and that it’s registering an average latency of five minutes. That might explain the outage! The home-page service logs every request to the analytics service. If the home-page service hasn’t implemented a timeout on its request to the analytics service, then it will just sit and wait until that request completes.

The frustrating thing is that the request to the analytics service isn’t even that important. The company would much rather the site be up and some analytics not recorded than the whole site be down!

Without a service mesh, Miles would need to wake up the home-page developers and get them to implement a timeout in the code and then do a full redeploy. Luckily, the service mesh lets Miles control traffic for any service from the control plane. Miles pushes a new configuration that sets the timeout for requests to the analytics service to 100 milliseconds. He then tries again to load the home page: it loads almost instantly. Miles shoots a message to the on-call chat channel that they need to look at what’s going wrong with the analytics service in the morning. Then he goes back to bed.

Traffic Control

Traffic control is about controlling where traffic between services is routed. Traffic control solves many problems:

Implementing deployment strategies such as canary deployments, in which a small amount of “canary” traffic is routed to the new version of a service to see if it’s working before fully rolling out the new version
Monolith to microservices migrations, in which services are split off from the monolith and traffic previously routed to the monolith is seamlessly rerouted to the new microservices
Multi-cluster failover, in which traffic is routed to services in other healthy clusters if the local cluster is down

Traffic Control Case Study

B’Elanna is excited. She’s finally got the go-ahead to split out some of the large monolithic service she works on into microservices. The rest of her organization already uses microservices, but her service is the oldest and has grown large and bloated.

Her team plans to split out two microservices from the monolith, a member service and a cart service, and they need to do so without any downtime. Many other microservices depend on the monolith, so as they split it up, they need to ensure that those microservices call the newly split-out services.

Without a service mesh, B’Elanna would need to update all the dependent services to call the new member and cart services instead of the monolith. For example, requests to the /members endpoint now need to go to the new member service, and requests to the /cart endpoint need to go to the new cart service. This means updating a lot of code across many services.

Luckily, B’Elanna’s company is running a service mesh. As shown in Figure 1-10, she can configure the service mesh to automatically route any request matching /members to the new member service and likewise for the cart service. The dependent services can continue making their requests as normal and the service mesh handles all the routing. This means B’Elanna can focus on splitting up the monolith rather than updating all the dependent services.

Figure 1-10. The service mesh can be configured to route certain requests to different services

Features in Combination

Now you should understand the types of features a service mesh provides around security, observability, reliability, and traffic control. Alone, these features are helpful, but they are even more powerful in combination.

For example, observability data provided by the service mesh can be combined with reliability and traffic control features. If the mesh detects that a service instance is returning errors, it can redirect traffic to healthy instances or a whole different cluster. Or mesh security features can be combined with the observability features to detect when a service is attempting to make requests it’s not authorized to make—potentially indicating a security breach. When you deploy Consul yourself, you’ll see many use cases where you can combine the service mesh features.

If your organization needs these features, you must decide whether it’s worth the additional complexity of a service mesh, or if you should implement them in service code. The key to answering that question is examining your scale.

When to Use a Service Mesh

There is no doubt that deploying a service mesh adds additional complexity. You now have sidecar proxies and the service mesh control plane to manage. In addition, you will need more compute resources (CPU and memory) to run the proxies and control plane, and now all traffic takes an extra hop through the local sidecar proxies, which adds latency. Implementing service mesh features in code would save resources and reduce infrastructure complexity (although it would add code complexity). For a service mesh to be worth it, it must provide a lot of value to your organization.

A simple formula for knowing when to use a service mesh is when you (a) need to solve networking problems in the areas outlined previously (security, observability, reliability, and traffic control) and (b) your organization is at a scale, or will soon be at a scale, where it’s too costly to solve those problems in service code.

For example, say your organization is moving to what’s known as a zero trust security architecture where all internal traffic is encrypted, authenticated, and authorized. If you’re only running two microservices, you can easily recode those services. However, if you’re running 400 microservices, then it’s unlikely that you’ll be able to recode all those services in a reasonable amount of time. In this case, a service mesh makes a lot of sense.

In addition, at a certain scale, there will be services and workloads that you want control over where you don’t actually have the ability to edit their code. For example, maybe you’re deploying a packaged open source software, or perhaps you’re using a cloud-managed database. Ideally, you would have the same control over those workloads that you do over your other services.

In the end, the exact scale at which it makes sense to use a service mesh will depend on your specific organization and the problems you’re trying to solve. I hope that this book will help you understand the problems a service mesh solves and help you gauge whether it makes sense in your situation.

Shared Libraries?

Many engineers who hear the question “How would I implement this across every service?” immediately think about shared libraries. A shared library is a good solution for smaller organizations, but it has many problems past a certain scale.

First, someone has to write the shared library or choose a good one from the open source options. If you use multiple programming languages, your library must support all of them (and any future languages). Second, you must recode all your services to use that library and redeploy them. In large organizations, this is a massive undertaking. Third, if you ever need to update that library (and you will), you then need to update and redeploy every service again.

In some niche use cases, shared libraries are a good option, but they are a poor solution for solving the problems a service mesh solves.

Summary

In this chapter, you learned what a service mesh is, how it works, and why you’d use one.

I introduced my definition of a service mesh:

An infrastructure layer that enables you to control the network communication of your workloads from a single control plane.

And I discussed how the two components of a service mesh, the proxies and the control plane, enable the control of network communication. You walked through a real-life example of a working service mesh, and I discussed the four categories of service mesh features: security, observability, reliability, and traffic control.

Finally, I addressed when you should use a service mesh: when you need these features and you’re at a scale where it’s too costly to implement them in service code.

So far, everything discussed in this chapter has been applicable to most service meshes and not specific to Consul. The next chapter is devoted to Consul in particular. You’ll learn about how it works, its architecture, the protocols it uses, and what makes it unique.

¹ Some meshes use other technology such as iptables or eBPF to control traffic rather than a separate proxy process.

² If it’s impossible to deploy a local proxy—for example, with a managed service such as Amazon Relational Database Service—you can use a terminating gateway as covered in Chapter 10.

Get Consul: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Consul: Up and Running by Luke Kysow

Chapter 1. Service Mesh 101

Figure 1-1. A typical software stack

How a Service Mesh Works

Sidecar Proxies

Figure 1-2. Two service instances communicating through the service mesh

Control Plane

Figure 1-3. The control plane manages the sidecar proxies

Concrete Example

Figure 1-4. Example service mesh architecture

Figure 1-5. `frontend`’s request to `backend`, and `backend`’s response, is routed through both proxies

Figure 1-6. The service mesh emitting metrics

Figure 1-7. A typical service mesh with many services being managed

Why Use a Service Mesh

Security

Figure 1-8. A service mesh issuing certificates and encrypting traffic

Observability

Reliability

Figure 1-9. The service mesh can be configured to retry failed requests to other instances

Traffic Control

Features in Combination

When to Use a Service Mesh

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Chapter 1. Service Mesh 101

Figure 1-1. A typical software stack

How a Service Mesh Works

Sidecar Proxies

Figure 1-2. Two service instances communicating through the service mesh

Control Plane

Figure 1-3. The control plane manages the sidecar proxies

Concrete Example

Figure 1-4. Example service mesh architecture

Figure 1-5. frontend’s request to backend, and backend’s response, is routed through both proxies

Figure 1-6. The service mesh emitting metrics

Figure 1-7. A typical service mesh with many services being managed

Why Use a Service Mesh

Security

Figure 1-8. A service mesh issuing certificates and encrypting traffic

Observability

Reliability

Figure 1-9. The service mesh can be configured to retry failed requests to other instances

Traffic Control

Features in Combination

When to Use a Service Mesh

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Figure 1-5. `frontend`’s request to `backend`, and `backend`’s response, is routed through both proxies