In a recent episode of the O’Reilly Media Podcast, we spoke with George Miranda about the importance of service mesh technology in creating reliable distributed systems. As discussed in the new report The Service Mesh: Resilient Service-to-Service Communication for Cloud Applications, service mesh technology has emerged as a popular tool for companies looking to build cloud-native applications that are reliable and secure.
During the podcast, we discussed the problems a service mesh infrastructure solves and the service mesh features you’ll find most valuable. We also talked about how to choose the right service mesh for your organization, the challenges involved in getting it deployed to production, and the best ways for getting started with a service mesh.
Here are some highlights from the conversation:
The rise of containers
The rise of containers has made it easy to adopt patterns that were once relegated only to the ultra-large web-scale giants of the world. There are all sorts of benefits to that modular cloud-native-driven approach. I'm not going to rehash those here, but what I see is that, oftentimes, organizations adopt these new patterns without a whole lot of forethought to how it's going to impact their applications and production. We see the benefits, but we don't always realize some of the rub that's behind them.
The fallacies of distributed computing
It turns out that your network is unreliable, it turns out that latency is not zero, transport cost is not zero, bandwidth is finite, and so on. All of those limitations need to be accounted for. But what we're seeing now is that in applications that are shifting to this microservice world, typically, those applications have never had to account for this kind of distributed nature in the past. A service mesh gives you ways to easily solve those problems without needing to change your applications.
Managing, monitoring, and controlling distributed apps
The tunables that are exposed in the service mesh’s control plane give you control that you've never had before. Things like performance-based load balancing—not just things like round robin, but load balancing schemes based on performance metrics observed in session layer—timeouts, and retries. Retries can fall into lengthy retry loops consuming resources, creating bottlenecks, and causing secondary failures. You also get a lot of constructs to help mitigate cascading failures, create custom routing rules, set up mutual TLS, provide rich service metrics and so on. At a high level, those are some of the basic components that you can expect in any tool calling itself a “service mesh.”
Choosing and deploying a service mesh
It's important to choose a service mesh tool that builds around an ability to be introspective. Runtime diagnostics are a really big deal in production. You have to be able to see what's happening and determine what's happening in the service mesh layer and what's actually happening in the application layer. Otherwise, there's going to be a lot of blame and sadness and tears and confusion when things go wrong. That just jeopardizes the entire production push. From the technical perspective, you should be able to select tools with really well-understood failure modes, and with great constructs around observability.
The biggest hurdle to deploying a service mesh
How do you get buy-in across your organization to support this layer? A lot of that is just a process of understanding the needs of your stakeholders and aligning with their values. William Morgan, our CEO, likes to say, "Any sufficiently advanced engineering work is indistinguishable from sales." I think that's true. You have to sell it internally. You have to educate people on the value, understand what their needs are and have a fit. And, above all, you have to be crystal clear—what real business problem is this new tool solving? Because without that kind of clarity, you're going to have a really hard time deploying to production. Because, again, this new tool will inevitably experience some kind of failure, and if you don't understand that failure well, and if you don't understand why it's strategic to your business to tolerate that kind of failure while you figure it out, that is going to be the next biggest hurdle and the biggest challenge to getting this deployed—and staying deployed—in your production environment.
This post is part of a collaboration between O'Reilly and Buoyant. See our statement of editorial independence.