“What is surprising is not that there are so many accidents. It is that there are so few. The thing that amazes you is not that your system goes down sometimes, it’s that it is up at all.”—Richard Cook
In September 2007, Jean Bookout, 76, was driving her Toyota Camry down an unfamiliar road in Oklahoma, with her friend Barbara Schwarz seated next to her on the passenger side. Suddenly, the Camry began to accelerate on its own. Bookout tried hitting the brakes, applying the emergency brake, but the car continued to accelerate. The car eventually collided with an embankment, injuring Bookout and killing Schwarz. In a subsequent legal case, lawyers for Toyota pointed to the most common of culprits in these types of accidents: human error. “Sometimes people make mistakes while driving their cars,” one of the lawyers claimed. Bookout was older, the road was unfamiliar, these tragic things happen.
However, a recently concluded product liability case against Toyota has turned up a very different cause: a stack overflow error in Toyota’s software for the Camry. This is noteworthy for two reasons: first, the oft-cited culprit in accidents—human error—proved not to be the cause (a problematic premise in its own right), and second, it demonstrates how we have definitively crossed a threshold from software failures causing minor annoyances or (potentially large) corporate revenue losses into the realm of human safety.
It might be easy to dismiss this case as something minor: a fairly vanilla software bug that (so far) appears to be contained to a specific car model. But the extrapolation is far more interesting. Consider the self-driving car, development for which is well underway already. We take out the purported culprit for so many accidents, human error, and the premise is that a self-driving car is, in many respects, safer than a traditional car. But what happens if a failure that’s completely out of the car’s control occurs? What if the data feed that’s helping the car to recognize stop lights fails? What if Google Maps tells it to do something stupid that turns out to be dangerous?
We have reached a point in software development where we can no longer understand, see, or control all the component parts, both technical and social/organizational—they are increasingly complex and distributed. The business of software itself has become a distributed, complex system. How do we develop and manage systems that are too large to understand, too complex to control, and that fail in unpredictable ways?
Distributed systems once were the territory of computer science Ph.D.s and software architects tucked off in a corner somewhere. That’s no longer the case. Just because you write code on a laptop and don’t have to care about message passing and lockouts doesn’t mean you don’t have to worry about distributed systems. How many API calls to external services are you making? Is your code going to end up on desktop sites and mobile devices—do you even know all the possible devices? What do you know now about the network constraints that may be present when your app is actually run? Do you know what your bottlenecks will be at a certain level of scale?
One thing we know from classic distributed computing theory is that distributed systems fail more often, and the failures often tend to be partial in nature. Such failures are not just harder to diagnose and predict; they’re likely to be not reproducible—a given third-party data feed goes down or you get screwed by a router in a town you’ve never even heard of before. You’re always fighting the intermittent failure, so is this a losing battle?
The solution to grappling with complex distributed systems is not simply more testing, or Agile processes. It’s not DevOps, or continuous delivery. No one single thing or approach could prevent something like the Toyota incident from happening again. In fact, it’s almost a given that something like that will happen again. The answer is to embrace that failures of an unthinkable variety are possible—a vast sea of unknown unknowns—and to change how we think about the systems we are building, not to mention the systems within which we already operate.
Think globally, develop locally
Okay, so anyone who writes or deploys software needs to think more like a distributed systems engineer. But what does that even mean? In reality, it boils down to moving past a single-computer mode of thinking. Until very recently, we’ve been able to rely on a computer being a relatively deterministic thing. You write code that runs on one machine, you can make assumptions about what, say, the memory lookup is. But nothing really runs on one computer any more—the cloud is the computer now. It’s akin to a living system, something that is constantly changing, especially as companies move toward continuous delivery as the new normal.
So, you have to start by assuming the system in which your software runs will fail. Then you need hypotheses about why and how, and ways to collect data on those hypotheses. This isn’t just saying “we need more testing,” however. The traditional nature of testing presumes you can delineate all the cases that require testing, which is fundamentally impossible in distributed systems. (That’s not to say that testing isn’t important, but it isn’t a panacea, either.) When you’re in a distributed environment and most of the failure modes are things you can’t predict in advance and can’t test for, monitoring is the only way to understand your application’s behavior.
Data are the lingua franca of distributed systems
If we take the living-organism-as-complex-system metaphor a bit further, it’s one thing to diagnose what caused a stroke after the fact versus to catch it early in the process of happening. Sure, you can look at the data retrospectively and see the signs were there, but what you want is an early warning system, a way to see the failure as it’s starting, and intervene as quickly as possible. Digging through averaged historical time series data only tells you what went wrong, that one time. And in dealing with distributed systems, you’ve got plenty more to worry about than just pinging a server to see if it’s up. There’s been an explosion in tools and technologies around measurement and monitoring, and I’ll avoid getting into the weeds on that here, but what matters is that, along with becoming intimately familiar with how histograms are generally preferable to averages when it comes to looking at your application and system data, developers can no longer think of monitoring as purely the domain of the embattled system administrator.
Humans in the machine
There are no complex software systems without people. Any discussion of distributed systems and managing complexity ultimately must acknowledge the roles people play in the systems we design and run. Humans are an integral part of the complex systems we create, and we are largely responsible for both their variability and their resilience (or lack thereof). As designers, builders, and operators of complex systems, we are influenced by a risk-averse culture, whether we know it or not. In trying to avoid failures (in processes, products, or large systems), we have primarily leaned toward exhaustive requirements and creating tight couplings in order to have “control,” but this often leads to brittle systems that are in fact more prone to break or fail.
And when they do fail, we seek blame. We ruthlessly hunt down the so-called “cause” of the failure—a process that is often, in reality, more about assuaging psychological guilt and unease than uncovering why things really happened the way they did and avoiding the same outcome in the future. Such activities typically result in more controls, engendering increased brittleness in the system. The reality is that most large failures are the result of a string of micro-failures leading up to the final event. There is no root cause. We’d do better to stop looking for one, but trying to do so is fighting a steep uphill battle against cultural expectations and strong, deeply ingrained psychological instincts.
The processes and methodologies that worked adequately in the ’80s, but were already crumbling in the ’90s, have completely collapsed. We’re now exploring new territory, new models for building, deploying, and maintaining software—and, indeed, organizations themselves. We will continue to develop these topics in future Radar posts, and, of course, at our Velocity conferences in Santa Clara, Beijing, New York, and Barcelona.
Photo by Mark Skipper, used under a Creative Commons license.