The technology landscape has evolved into an always-on environment of mobile, social, and cloud applications where programs can be accessed and used across a multitude of devices.
These always-on and always-available expectations are handled by distributed systems, which manage the inevitable fluctuations and failures of complex computing behind the scenes.
“The increasing criticality of these systems means that it is necessary for these online systems to be built for redundancy, fault tolerance, and high availability,” writes Brendan Burns, distinguished engineer at Microsoft, in Designing Distributed Systems. “The confluence of these requirements has led to an order of magnitude increase in the number of distributed systems that need to be built.”
In Distributed Systems in One Lesson, developer relations leader and teacher Tim Berglund says a simple way to think about distributed systems is that they are a collection of independent computers that appears to its user as a single computer.
Virtually all modern software and applications built today are distributed systems of some sort, says Sam Newman, director at Sam Newman & Associates and author of Building Microservices. Even a monolithic application talking to a database is a distributed system, he says, “just a very simple one.”
While those simple systems can technically be considered distributed, when engineers refer to distributed systems they’re typically talking about massively complex systems made up of many moving parts communicating with one another, with all of it appearing to an end-user as a single product, says Nora Jones, a senior software engineer at Netflix.
Think anything from, well, Netflix, to an online store like Amazon, to an instant messaging platform like WhatsApp, to a customer relationship management application like Salesforce, to Google’s search application. These systems require everything from login functionality, user profiles, recommendation engines, personalization, relational databases, object databases, content delivery networks, and numerous other components all served up cohesively to the user.
Benefits of distributed systems
These days, it’s not so much a question of why a team would use a distributed system, but rather when they should shift in that direction and how distributed the system needs to be, experts say.
Here are three inflection points—the need for scale, a more reliable system, and a more powerful system—when a technology team might consider using a distributed system.
Computing processes across a distributed system happen independently from one another, notes Berglund in Distributed Systems in One Lesson. This makes it easy to add nodes and functionality as needed. Distributed systems offer “the ability to massively scale computing power relatively inexpensively, enabling organizations to scale up their businesses to a global level in a way that was not possible even a decade ago,” write Chad Carson, cofounder of Pepperdata, and Sean Suchter, director of Istio at Google, in Effective Multi-Tenant Distributed Systems.
Distributed systems create a reliable experience for end users because they rely on “hundreds or thousands of relatively inexpensive computers to communicate with one another and work together, creating the outward appearance of a single, high-powered computer,” write Carson and Suchter. In a single-machine environment, if that machine fails then so too does the entire system. When computation is spread across numerous machines, there can be a failure at one node that doesn’t take the whole system down, writes Cindy Sridharan, distributed systems engineer, in Distributed Systems Observability.
In Designing Distributed Systems, Burns notes that a distributed system can handle tasks efficiently because work loads and requests are broken into pieces and spread over multiple computers. This work is completed in parallel and the results are returned and compiled back to a central location.
The challenges of distributed systems
While the benefits of creating distributed systems can be great for scaling and reliability, distributed systems also introduce complexity when it comes to design, construction, and debugging. Presently, most distributed systems are one-off bespoke solutions, writes Burns in Designing Distributed Systems, making them difficult to troubleshoot when problems do arise.
Here are three of the most common challenges presented by distributed systems.
Because the work loads and jobs in a distributed system do not happen sequentially, there must be prioritization, note Carson and Suchter in Effective Multi-Tenant Distributed Systems:
One of the primary challenges in a distributed system is in scheduling jobs and their component processes. Computing power might be quite large, but it is always finite, and the distributed system must decide which jobs should be scheduled to run where and when, and the relative priority of those jobs. Even sophisticated distributed system schedulers have limitations that can lead to underutilization of cluster hardware, unpredictable job run times, or both.
Take Amazon, for example. Amazon technology teams need to understand which aspects of the online store need to be called upon first to create a smooth user experience. Should the search bar be called before the navigation bar? Think of the many ways both small and large that Amazon makes online shopping as useful as possible for its users.
With such a complex interchange between hardware computing, software calls, and communication between those pieces over networks, latency can become a problem for users.
“The more widely distributed your system, the more latency between the constituents of your system becomes an issue,” says Newman. “As the volume of calls over the networks increases, the more you’ll start to see transient partitions and potentially have to deal with them.”
Over time, this can lead to technology teams needing to make tradeoffs around availability, consistency, and latency, Newman says.
Performance monitoring and observability
Failure is inevitable, says Nora Jones, when it comes to distributed systems. How a technology team manages and plans for failure so a customer hardly notices it is key. When distributed systems become complex, observability into the technology stack to understand those failures is an enormous challenge.
Carson and Suchter illustrate this challenge in Effective Multi-Tenant Distributed Systems:
Truly useful monitoring for multi-tenant distributed systems must track hardware usage metrics at a sufficient level of granularity for each interesting process on each node. Gathering, processing, and presenting this data for large clusters is a significant challenge, in terms of both systems engineering (to process and store the data efficiently and in a scalable fashion) and the presentation-level logic and math (to present it usefully and accurately). Even for limited, node-level metrics, traditional monitoring systems do not scale well on large clusters of hundreds to thousands of nodes.
There are several approaches companies can use to detect those failure points, such as distributed tracing, chaos engineering, incident reviews, and understanding expectations of upstream and downstream dependencies. “There’s a lot of different tactics to achieve high quality and robustness, and they all fit into the category of having as much insight into the system as possible,” Jones says.
Ready to go deeper into distributed systems? Check out these recommended resources from O’Reilly’s editors.
Distributed Systems Observability — Cindy Sridharan provides an overview of monitoring challenges and trade-offs that will help you choose the best observability strategy for your distributed system.
Designing Distributed Systems — Brendan Burns demonstrates how you can adapt existing software design patterns for designing and building reliable distributed applications.
The Distributed Systems Video Collection — This 12-video collection dives into best practices and the future of distributed systems.
Effective Multi-Tenant Distributed Systems — Chad Carson and Sean Suchter outline the performance challenges of running multi-tenant distributed computing environments, especially within a Hadoop context.
Distributed Systems in One Lesson — Using a series of examples taken from a fictional coffee shop business, Tim Berglund helps you explore five key areas of distributed systems.
Chaos Engineering — This report introduces you to Chaos Engineering, a method of experimenting on infrastructure that lets you expose weaknesses before they become problems.
Designing Data-Intensive Applications — Martin Kleppmann examines the pros and cons of various technologies for processing and storing data.