Chapter 1. Service Mesh Fundamentals
Why is operating microservices difficult? What is a service mesh, and why do I need one?
Many emergent technologies build on or reincarnate prior thinking and approaches to computing and networking paradigms. Why is this phenomenon necessary? In the case of service meshes, we’ll blame the microservices and containers movement—the cloud-native approach to designing scalable, independently delivered services. Microservices have exploded what were once internal application communications into a mesh of service-to-service remote procedure calls (RPCs) transported over networks. Bearing many benefits, microservices provide democratization of language and technology choice across independent service teams—teams that create new features quickly as they iteratively and continuously deliver software (typically as a service).
Operating Many Services
And, sure, the first few microservices are relatively easy to deliver and operate—at least compared to what difficulties organizations face the day they arrive at many microservices. Whether that “many” is 10 or 100, the onset of a major headache is inevitable. Different medicines are dispensed to alleviate microservices headaches; use of client libraries is one notable example. Language and framework-specific client libraries, whether preexisting or created, are used to address distributed systems challenges in microservices environments. It’s in these environments that many teams first consider their path to a service mesh. The sheer volume of services that must be managed on an individual, distributed basis (versus centrally as with monoliths) and the challenges of ensuring reliability, observability, and security of these services cannot be overcome with outmoded paradigms; hence, the need to reincarnate prior thinking and approaches. New tools and techniques must be adopted.
Given the distributed (and often ephemeral) nature of microservices—and how central the network is to their functioning—it behooves us to reflect on the fallacy that networks are reliable, are without latency, have infinite bandwidth, and that communication is guaranteed. When you consider how critical the ability to control and secure service communication is to distributed systems that rely on network calls with each and every transaction, each and every time an application is invoked, you begin to understand that you are under tooled and why running more than a few microservices on a network topology that is in constant flux is so difficult. In the age of microservices, a new layer of tooling for the caretaking of services is needed—a service mesh is needed.
What Is a Service Mesh?
Service meshes provide policy-based networking for microservices describing desired behavior of the network in the face of constantly changing conditions and network topology. At their core, service meshes provide a developer-driven, services-first network; a network that is primarily concerned with alleviating application developers from building network concerns (e.g., resiliency) into their application code; a network that empowers operators with the ability to declaratively define network behavior, node identity, and traffic flow through policy.
Value derived from the layer of tooling that service meshes provide is most evident in the land of microservices. The more services, the more value derived from the mesh. In subsequent chapters, I show how service meshes provide value outside of the use of microservices and containers and help modernize existing services (running on virtual or bare metal servers) as well.
Architecture and Components
Although there are a few variants, service mesh architectures commonly comprise two planes: a control plane and data plane. The concept of these two planes immediately resonate with network engineers by the analogous way in which physical networks (and their equipment) are designed and managed. Network engineers have long been trained on divisions of concern by planes as shown in Figure 1-1.
The physical networking data plane (also known as the forwarding plane) contains application traffic generated by hosts, clients, servers, and applications that use the network as transport. Thus, data-plane traffic should never have source or destination IP addresses that belong to any network elements such as routers and switches; rather, they should be sourced from and destined to end devices such as PCs and servers. Routers and switches use hardware chips—application-specific integrated circuits (ASICs)—to optimally forward data-plane traffic as quickly as possible.
Let’s contrast physical networking planes and network topologies with those of service meshes.
Physical network planes
The physical networking control plane operates as the logical entity associated with router processes and functions used to create and maintain necessary intelligence about the state of the network (topology) and a router’s interfaces. The control plane includes network protocols, such as routing, signaling, and link-state protocols that are used to build and maintain the operational state of the network and provide IP connectivity between IP hosts.
The physical networking management plane is the logical entity that describes the traffic used to access, manage, and monitor all of the network elements. The management plane supports all required provisioning, maintenance, and monitoring functions for the network. Although network traffic in the control plane is handled in-band with all other data-plane traffic, management-plane traffic is capable of being carried via a separate out-of-band (OOB) management network to provide separate reachability in the event that the primary in-band IP path is not available (and create a security boundary).
Physical networking control and data planes are tightly coupled and generally vendor provided as a proprietary integration of hardware and firmware. Software-defined networking (SDN) has done much to insert standards and decouple. We’ll see that control and data planes of service meshes are not necessarily tightly coupled.
Physical network topologies
Common physical networking topologies include star, spoke-and-hub, tree (also called hierarchical), and mesh. As depicted in Figure 1-2, nodes in mesh networks connect directly and nonhierarchically such that each node is connected to an arbitrary number (usually as many as possible or as needed dynamically) of neighbor nodes so that there is at least one path from a given node to any other node to efficiently route data.
When I designed mesh networks as an engineer at Cisco, I did so to create fully interconnected, wireless networks. Wireless is the canonical use case for mesh networks for which the networking medium is readily susceptible to line-of-sight, weather-induced, or other disruption, and, therefore, for which reliability is of paramount concern. Mesh networks generally self-configure, enabling dynamic distribution of workloads. This ability is particularly key to both mitigate risk of failure (improve resiliency) and to react to continuously changing topologies. It’s readily apparent why this network topology is the design of choice for service mesh architectures.
Service mesh network planes
Again, service mesh architectures typically employ data and control planes (see Figure 1-3). Service meshes typically consolidate the analogous physical network control and management planes into the control plane, leaving some observability aspects of the management plane as integration points to external monitoring tools. As in physical networking, service mesh data planes handle the actual inspection, transiting, and routing of network traffic, whereas the control plane sits out-of-band providing a central point of management and backend/underlying infrastructure integration. Depending upon which architecture you use, both planes might or might not be deployed.
A service mesh data plane (otherwise known as the proxying layer) intercepts every packet in the request and is responsible for health checking, routing, load balancing, authentication, authorization, and generation of observable signals. Service proxies are transparently inserted, and as applications make service-to-service calls, applications are unaware of the data plane’s existence. Data planes are responsible for intracluster communication as well as inbound (ingress) and outbound (egress) cluster network traffic. Whether traffic is entering the mesh (ingressing) or leaving the mesh (egressing), application service traffic is directed first to the service proxy for handling. In Istio’s case, traffic is transparently intercepted using iptables rules and redirected to the service proxy.
A service mesh control plane is called for when the number of proxies becomes unwieldy or when a single point of visibility and control is required. Control planes provide policy and configuration for services in the mesh, taking a set of isolated, stateless proxies and turning them into a service mesh. Control planes do not directly touch any network packets in the mesh. They operate out-of-band. Control planes typically have a command-line interface (CLI) and user interface with which to interact, each of which provides access to a centralized API for holistically controlling proxy behavior. You can automate changes to the control plane configuration through its APIs (e.g., by a continuous integration/continuous deployment pipeline), where, in practice, configuration is most often version controlled and updated.
Proxies are generally considered stateless, but this is a thought-provoking concept. In the way in which proxies are generally informed by the control plane of the presence of services, mesh topology updates, traffic and authorization policy, and so on, proxies cache the state of the mesh but aren’t regarded as the source of truth for the state of the mesh.
Reflecting on Linkerd (pronounced “linker-dee”) and Istio as two popular open source service meshes, we find examples of how the data and control planes are packaged and deployed. In terms of packaging, Linkerd contains both its proxying components (
linkerd) and its control plane (
namerd) packaged together simply as “Linkerd,” and Istio brings a collection of control-plane components (
Citadel) to pair by default with Envoy (a data plane) packaged together as “Istio.” Envoy is often labeled a service mesh, inappropriately so, because it takes packaging with a control plane (we cover a few projects that have done so) to form a service mesh. Popular as it is, Envoy is often found deployed more simply standalone as an API or ingress gateway.
In terms of control-plane deployment, using Kubernetes as the example infrastructure, control planes are typically deployed in a separate “system” namespace. In terms of data-plane deployment, some service meshes, like Conduit, have proxies that are created as part of the project and are not designed to be configured by hand, but are instead designed for their behavior to be entirely driven by the control plane. Although other service meshes, like Istio, choose not to develop their own proxy; instead, they ingest and use independent proxies (separate projects), which, as a result, facilitates choice of proxy and its deployment outside of the mesh (standalone).
Why Do I Need One?
At this point, you might be thinking, “I have a container orchestrator. Why do I need another infrastructure layer?” With microservices and containers mainstreaming, container orchestrators provide much of what the cluster (nodes and containers) need. Necessarily so, the core focus of container orchestrators is scheduling, discovery, and health, focused primarily at an infrastructure level (Layer 4 and below, if you will). Consequently, microservices are left with unmet, service-level needs. A service mesh is a dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable, often relying on a container orchestrator or integration with another service discovery system for operation. Service meshes often deploy as a separate layer atop container orchestrators but do not require one in that control and data-plane components may be deployed independent of containerized infrastructure. As you’ll see in Chapter 3, a node agent (including service proxy) as the data-plane component is often deployed in non-container environments.
As noted, in microservices deployments, the network is directly and critically involved in every transaction, every invocation of business logic, and every request made to the application. Network reliability and latency are at the forefront of concerns for modern, cloud-native applications. A given cloud-native application might be composed of hundreds of microservices, each of which might have many instances and each of those ephemeral instances rescheduled as and when necessary by a container orchestrator.
Understanding the network’s criticality, what would you want out of a network that connects your microservices? You want your network to be as intelligent and resilient as possible. You want your network to route traffic away from failures to increase the aggregate reliability of your cluster. You want your network to avoid unwanted overhead like high-latency routes or servers with cold caches. You want your network to ensure that the traffic flowing between services is secure against trivial attack. You want your network to provide insight by highlighting unexpected dependencies and root causes of service communication failure. You want your network to let you impose policies at the granularity of service behaviors, not just at the connection level. And, you don’t want to write all this logic into your application.
You want Layer 5 management. You want a services-first network. You want a service mesh.
Value of a Service Mesh
Service meshes provide visibility, resiliency, traffic, and security control of distributed application services. Much value is promised here, particularly to the extent that much is begotten without the need to change your application code (or much of it).
Many organizations are initially attracted to the uniform observability that service meshes provide. No complex system is ever fully healthy. Service-level telemetry illuminates where your system is behaving sickly, illuminating difficult-to-answer questions like why your requests are slow to respond. Identifying when a specific service is down is relatively easy, but identifying where it’s slow and why, is another matter.
From the application’s vantage point, service meshes largely provide black-box monitoring (observing a system from the outside) of service-to-service communication, leaving white-box monitoring (observing a system from the inside—reporting measurements from inside-out) of an application as the responsibility of the microservice. Proxies that comprise the data plane are well-positioned (transparently, in-band) to generate metrics, logs, and traces, providing uniform and thorough observability throughout the mesh as a whole, as seen in Figure 1-4.
You are probably accustomed to having individual monitoring solutions for distributed tracing, logging, security, access control, and so on. Service meshes centralize and assist in solving these observability challenges by providing the following:
- Logs are used to baseline visibility for access requests to your entire fleet of services. Figure 1-5 illustrates how telemetry transmitted through service mesh logs include source and destination, request protocol, endpoint (URL), associated response code, and response time and size.
- Metrics are used to remove dependency and reliance on the development process to instrument code to emit metrics. When metrics are ubiquitous across your cluster, they unlock new insights. Consistent metrics enables automation for things like autoscaling, as an example. Example telemetry emitted by service mesh metrics include global request volume, global success rate, individual service responses by version, source and time, as shown in Figure 1-6.
- Without tracing, slow services (versus services that simply fail) are most difficult to debug. Imagine manual enumeration of all of your service dependencies being tracked in a spreadsheet. Traces are used to visualize dependencies, request volumes, and failure rates. Imagine manual enumeration of all of your dependencies being tracked in a spreadsheet. With automatically generated span identifiers, service meshes make integrating tracing functionality almost effortless. Individual services in the mesh still need to forward context headers, but that’s it. In contrast, many application performance management (APM) solutions require manual instrumentation to get traces out of your services. Later, you’ll see that in the sidecar proxy deployment model, sidecars are ideally positioned to trace the flow of requests across services.
Service meshes provide granular, declarative control over network traffic to determine where a request is routed to perform canary release, for example. Resiliency features typically include circuit breaking, latency-aware load balancing, eventually consistent service discovery, retries, timeouts, and deadlines.
Timeouts provide cancellation of service requests when a request doesn’t return to the client within a predefined time. Timeouts limit the amount of time spent on any individual request, commonly enforced at a point in time after which a response is considered invalid or too long for a client (user) to wait. Deadlines are an advanced service mesh feature in that they facilitate the feature-level timeouts (a collection of requests) rather than independent service timeouts, helping to avoid retry storms. Deadlines deduct time left to handle a request at each step, propagating elapsed time with each downstream service call as the request travels through the mesh. Timeouts and deadlines, illustrated in Figure 1-7, can be considered as enforcers of your Service-Level Objectives (SLOs).
When a service times-out or is unsuccessfully returned, you might choose to retry the request. Simple retries bear the risk of making things worse by retrying the same call to a service that is already under water (retry three times = 300% more service load). Retry budgets (aka maximum retries), however, provide the benefit of multiple tries but with a limit so as to not overload what is already a load-challenged service. Some service meshes take the elimination of client contention further by introducing jitter and an exponential back-off algorithm in the calculation of timing the next retry attempt.
Instead of retrying and adding more load to the service, you might elect to fail fast and disconnect the service, disallowing calls to it. Circuit breaking provides configurable timeouts (failure thresholds) to ensure safe maximums and facilitate graceful failure commonly for slow-responding services. Using a service mesh as a separate layer to implement circuit breaking avoids undue overhead on applications (services) at a time when they are already oversubscribed.
Rate limiting (throttling) is used to ensure stability of a service so that when one client causes a spike in requests, the service continues to run smoothly for other clients. Rate limits are usually measured over a period of time, but you can use different algorithms (fixed or sliding window, sliding log, etc.). Rate limits are typically operationally focused on ensuring that your services aren’t oversubscribed. When a limit is reached, well-implemented services commonly adhere to IETF RFC 6585, sending 429 Too Many Requests as the response code, including headers, such as the following, describing the request limit, number of requests remaining, and amount of time remaining until the request counter is reset:
X-RateLimit-Limit: 60 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1372016266
Rate limiting protects your services from overuse by limiting how often a client (most often mapped to a user access token) can call your service(s), and provides operational resiliency (e.g., service A can handle only 500 requests per second).
Subtlety distinguished is quota management (or conditional rate limiting) that is primarily used for accounting of requests based on business requirements as opposed to limiting rates based on operational concerns. It can be difficult to distinguish between rate limiting and quota management, given that these two features can be implemented by the same service mesh capability but presented differently to users.
The canonical example of a quota management is to configure a policy setting a threshold for the number of client requests allowed to a service over the course of time, like user Lee is subscribed to the Free service plan and allowed only 10 requests per day. Quota policy enforces consumption limits on services by maintaining a distributed counter that tallies incoming requests often using an in-memory datastore like Redis. Conditional rate limits are a powerful service mesh capability when implemented based on a user-defined set of arbitrary attributes.
Most service meshes provide a certificate authority to manage keys and certificates for securing service-to-service communication. Certificates are generated per service and provided unique identity of that service. When sidecar proxies are used (discussed later in Chapter 3, they take on the identity of the service and perform life-cycle management of certificates (generation, distribution, refresh, and revocation) on behalf of the service. In sidecar proxy deployments, you’ll typically find that local TCP connections are established between the service and sidecar proxy, whereas mutual Transport Layer Security (mTLS) connections are established between proxies, as demonstrated in Figure 1-8.
Encrypting traffic internal to your application is an important security consideration. No longer are your application’s service calls kept inside a single monolith via localhost; they are exposed over the network. Allowing service calls without TLS on the transport is setting yourself up for security problems. When two mesh-enabled services communicate, they have strong cryptographic proof of their peers. After identities are established, they are used in constructing access control policies, determining whether a request should be serviced. Depending on the service mesh used, policy controls configuration of the key management system (e.g., certificate refresh interval) and operational access control used to determine whether a request is accepted. White and blacklists are used to identify approved and unapproved connection requests as well as more granular access control factors like time of day.
Delay and fault injection
The notion that your systems will fail must be embraced. Why not preemptively inject failure and verify behavior? Given that proxies sit in line to service traffic, they often support protocol-specific fault injection, allowing configuration of the percentage of requests that should be subjected to faults or network delay. For instance, generating HTTP 500 errors helps to verify the robustness of your distributed application in terms of how it behaves in response.
Injecting latency into requests without a service mesh can be a tedious task but is probably a more common issue faced during operation of an application. Slow responses that result in an HTTP 503 after a minute of waiting leaves users much more frustrated than a 503 after six seconds. Arguably, the best part of these resilience testing capabilities is that no application code needs to change in order to facilitate these tests. Results of the tests, on the other hand, might well have you changing application code.
Using a service mesh, developers invest much less in writing code to deal with infrastructure concerns—code that might be on a path to being commoditized by service meshes. The separation of service and session-layer concerns from application code manifests in the form of a phenomenon I refer to as a decoupling at Layer 5.
Decoupling at Layer 5
Service meshes help you to avoid bloated service code, fat on infrastructure concerns.
Duplicative work is avoided in making services production-ready by way of singularly addressing load balancing, autoscaling, rate limiting, traffic routing, and so on. Teams avoid inconsistency of implementation across different services to the extent that the same set of central control is provided for retries and budgets, failover, deadlines, cancellation, and so forth. Implementations done in silos lead to fragmented, non-uniform policy application and difficult debugging.
Service meshes insert a dedicated infrastructure layer between dev and ops, separating what are common concerns of service communication by providing independent control over them. The service mesh is a networking model that sits at a layer of abstraction above TCP/IP. Without a service mesh, operators are still tied to developers for many concerns as they need new application builds to control network traffic, shaping, affecting access control, and which services talk to downstream services. The decoupling of dev and ops is key to providing autonomous independent iteration.
Decoupling is an important trend in the industry. If you have a significant number of services, you nearly certainly have both of these two roles: developers and operators. Just as microservices is a trend in the industry for allowing teams to independently iterate, so do service meshes allow teams to decouple and iterate faster. Technical reasons for having to coordinate between teams dissolves in many circumstances, like the following:
Operators don’t necessarily need to involve Developers to change how many times a service should retry before timing out.
Customer Success teams can handle the revocation of client access without involving Operators.
Product Owners can use quota management to enforce price plan limitations for quantity-based consumption of particular services.
Developers can redirect their internal stakeholders to a canary with beta functionality without involving Operators.
Microservices decouple functional responsibilities within an application from one another, allowing development teams to independently iterate and move forward. Figure 1-9 shows that in the same fashion, service meshes decouple functional responsibilities of instrumentation and operating services from developers and operators, providing an independent point of control and centralization of responsibility.
Even though service meshes facilitate a separation of concerns, both developers and operators should understand the details of the mesh. The more everyone understands, the better. Operators can obtain uniform metrics and traces from running applications involving diverse language frameworks without relying on developers to manually instrument their applications. Developers tend to consider the network as a dumb transport layer that really doesn’t help with service-level concerns. We need a network that operates at the same level as the services we build and deploy.
Essentially, you can think of a service mesh as surfacing the session layer of the OSI model as a separately addressable, first-class citizen in your modern architecture.
The data plane carries the actual application request traffic between service instances. The control plane configures the data plane, provides a point of aggregation for telemetry, and also provides APIs for modifying the mesh’s behavior.
Decoupling of dev and ops avoids diffusion of the responsibility of service management, centralizing control over these concerns into a new infrastructure layer: Layer 5.
Service meshes makes it possible for services to regain a consistent, secure way to establish identity within a datacenter and, furthermore, do so based on strong cryptographic primitives rather than deployment topology.
With each deployment of a service mesh, developers are relieved of their infrastructure concerns and can refocus on their primary task (of creating business logic). More seasoned software engineers might have difficulty in breaking the habit and trusting that the service mesh will provide, or even displacing the psychological dependency on their handy (but less capable) client library.
Many organizations find themselves in the situation of having incorporated too many infrastructure concerns into application code. Service meshes are a necessary building block when composing production-grade microservices. The power of easily deployable service meshes will allow for many smaller organizations to enjoy features previously available only to large enterprises.