Chapter 1. What Is Resiliency?

I get knocked down
But I get up again
You are never going to keep me down

Tubthumping, Chumbawamba

This chapter covers the fundamental concepts of resiliency. I’ll define what resiliency is, and discuss the challenges of resiliency in the context of distributed systems. I will also look beyond just the technology side of things to show how behaviors, culture, organization and societal context can impact our ability to deliver resiliency.

What Is Resilience?

Resilience is the ability of a system to withstand problems, recover from issues that impact the system, and continue to evolve in order to maintain resilience even as the wider context of the system changes around it. A resilient system is one which is predictable, which can be relied upon - and, if needed, can get back up again if it does get knocked down.

Resilience is a desirable characteristic. When applied to a person, a mechanism, or a digital system, when did you ever think “I don’t want this to be resilient”?

As you’ll see throughout this chapter, you can approach the concept of resiliency from a number of different angles. Fundamentally, this book is about helping you to deliver systems. You can’t just look at software, or hardware in isolation - you need to look at the system as a whole. To get you started in terms of thinking more broadly about systems, let’s take a look at resiliency from the twin viewpoints of technology and society.

Technology

From a technology viewpoint, you can look at resiliency in terms of making sure that our program doesn’t crash. Ensuring that it can handle errors gracefully, and not fall over in a blaze of glory (and stack traces) leaving your users at best bewildered, and at worst in danger. It can also encompass our hardware. Can you tolerate a single machine unexpectedly powering off? Or should you invest in servers with redundant power supplies and networking to ensure they are less likely to fail?

With the advent of the cloud, things have shifted. Aspects around managing individual machines may no longer be under your direct control, instead you’ll have to consider which of a myriad of cloud products will deliver what you need.

Later in this chapter I’ll show you the specific challenges that distributed systems introduce, which cause us to take into considerations concepts like timeouts and retries and more, topics which I will spend a lot of time on in this book. Before that though you have to expand our horizons somewhat. What about the people who actually build, operate and use the systems you create?

Social

As a software developer, it can be tempting to try and just focus on the act of coding, to the exclusion of other “noise”. Dedicating our focus to algorithms, data structures, network protocols or other technical implementation details takes focus, commitment, and time. Unfortunately, it takes only a cursory examination of how things fail to realize that this is not only a naive view, it can even be dangerous.

For example, an environment where the act of raising safety concerns is discouraged or even punished will lead to issues that could have been caught before impacting users. The country or industry sector you operate in might require that systems are built in certain ways to comply with local legislation or compliance.

Different types of users will have different expectations about the systems they use. The impact of failures can also be starkly different - an information display in a shopping mall crashing and serving up a stack trace isn’t great, but it’s not as bad as a self-driving car accelerating when it should brake.

This of course also deals very neatly with the refrain that you should “keep politics out of technology”. Once you understand that our systems are influenced by the people who build and use the software, the culture they operate in, and the wider societal context, you see that this statement doesn’t really make much sense. For example I worked with a Nordic government agency who were basing their technology on Kubernetes hosted on Azure. They were avoiding the use of Azure-specific technologies, which was creating some challenges as they ended up having to do more work themselves. But they were aware that the political winds could change, and that an earlier decision to allow them to use a public cloud vendor could be reversed by a change in government policy.

It can be difficult to get a grasp on what societal aspects are in play when considering resiliency, especially if you have spent most of your career focusing on technical concerns. Later in this chapter, I will introduce a couple of models that can help provide some structure around these ideas and help you understand where you need to focus your time if you are to deliver truly resilient systems.

Why Resilience Matters

The majority of people in the world have their own computer, in the form of a smart phone¹, and many people may have more than that in the form of smart watches, smart TVs, tablets and laptops. As a result, we are surrounded with software-based systems, and they permeate both our professional and private lives in ways it would have been difficult to predict in the previous millennium.

Software is such a key part of the way most of us live our lives, and so pivotal to the work that you do. During the COVID pandemic, the resulting global lockdowns meant that there was a renewed focus on digital services as they became even more vital.

As a result the impact of software failures is all the more keenly felt. Towards the end of 2023, HSBC’s online banking system in the UK was offline for over 24 hours², leaving tens of thousands of people unable to pay bills. Also in 2023, both content delivery network Cloudflare and Workday³ had major service interruptions related to data center issues, impacting thousands of companies around the world. More recently in 2024, Macdonalds suffered a major outage⁴ impacting multiple countries around the world, stopping people from being able to place orders.

During the past decade, the systems we are creating have become more and more distributed. As a result of the shift from mainframes to modern server infrastructure, the solutions delivered are increasingly running on ever-more complex computer topologies, creating new challenges.

Given that the genie can’t be put back into the bottle, and that digital, software-based systems are here to stay, the resiliency of these systems has never been more key.

What Is A Distributed System?

The growing importance of software in our lives has coincided with the increasing distribution and scale of the systems you create. As you will see throughout this book, a distributed system can help us improve the resiliency of our systems, but at the same time introduces new sources of complexity that need to be taken into consideration. Before I get into that however, let’s look at what a distributed system actually is.

A distributed system is one where two or more computers⁵ talk to each other over a network. Distributed systems come in many shapes and sizes. Even a simple web application as shown in Figure 1-1. The code running the web backend runs on one computer, the database on a different computer, and the browsers viewing the web frontend on yet another computer.

A picture showing the architecture of a simple web application, with one computer running the code for the backend, with an arrow linking it to a database which runs on a second computer. The backend code sends the user interface to a browser, which is running on a third computer.

But distributed systems can get more complex than that. Consider a microservice architecture, as shown in Figure 1-2. The arrows between the microservices indicate logical dependencies, which represent some form of network-based communication⁶. An instance of a microservice would live on its own machine. Now you have multiple different services, each of which run on different computers, communicating with each other over networks. And this simplified diagram omits the fact that each of these microservices may well be talking to its own database, which will likely be on yet another machine, requiring another network hop. Oh, and each of these microservices might also be exposing their own frontends as well.

5 hexagons are shown each representing a different microservices, with arrows linking various microservices together. The microservices are named Returns, Shipping, Order, Inventory and Customer

To add further complexity, the diagrams shown so far are simplifications of what you might actually end up running in production. Many of the things shown on the previous two diagrams will often have multiple copies to provide redundancy or improve the ability to handle load, which can further complicate things.

Warning

In general, the more computers you have talking to each other in a distributed system, the more complicated things get.

How Distributed Systems Can Fail (Us)

But really, this is all a rather dry description of what a distributed system is. It doesn’t really communicate what it’s like to own and operate. As a result, I turn yet again to my favorite definition of what a distributed system is:

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Leslie Lamport

This quote gets to the heart of the interesting ways in which distributed systems can fail. I’ve had machines vanish on me (in one case caused by the wrong servers being packed up and shipped from the east coast of the US to the west). Network cables can be served, by accident or on purpose⁷. DNS misconfiguration can stop whole swathes of websites from working, such as the recent outage of the .ru top level domain⁸.

Distributed systems may well be a necessity, but their nature as an interconnected system of independent machines often just exposes them to more sources of failure.

Essentially though, the vast array of failure modes that you can encounter with a distributed system - and the above list just scratches the surface - come down to two fundamental rules that apply to any distributed system.

Two Golden Rules Of Distributed Systems

I’ve found that there really are two fundamental rules that any distributed system is governed by. They are:

You cannot beam information instantaneously between two points
Sometimes you can’t reach the thing you want to talk to

These rules will influence so much of how you think about building more resilient systems, and you’ll encounter them in different forms throughout the book. Before that however, let’s briefly explore these two rules to better understand the problems they create.

You Cannot Beam Information Instantaneously Between Two Points

It takes time for information to move around a distributed system. Typically, as a developer, you’ll use some sort of abstraction to transmit this information. Perhaps you’re using an HTTP client library, an RPC framework or maybe making use of a message broker. Abstractions in computing are incredibly useful - they allow us to hide detail and focus our minds at a higher level of abstraction. But by their nature abstractions hide things, so you can end up forgetting what is happening behind the scenes.

When you send a piece of data from a program on one computer to another, a lot of things need to happen. The data to be sent needs to be converted into a form that can be transmitted, then it has to be broken down into a series of packets which are then sent to the destination. When the data arrives, the receiving computer has to reassemble the packets, then convert it into a format which the program on the receiving computer can understand.

All of this takes time. How long it takes can depend on a multitude of factors - the type of network being used, the distance between machines, the size of payload being sent, the speed of the computers at each end of this interaction. You may have the ability to change some of these factors, but likely not all.

It’s possible that you might see the data you are sending arrive at its destination so quickly as to appear to be instantaneous. This is often the case for very small amounts of data. However, instant transmission of data between two computers is not possible⁹.

What problems does this cause? Well, it can lead to you seeing inconsistent state. You might send the same data to two different computers at the same time, but it doesn’t mean that the two destinations will receive this data at the same time. In the worst cases, this inconsistency can be surfaced to the end user causing understandable confusion - but if you are aware of this there may be ways for you to resolve this inconsistency when it occurs.

In the context of distributed system design in general, the fact that you have to account for the time taken for data to move around your system will cause you a number of issues. When looked at specifically in the context of resiliency however, it’s the second of our two golden rules that gives us most headaches.

Sometimes You Can’t Reach The Thing You Want To Talk To

As software developers you are increasingly working at higher levels of abstraction. You deploy applications to the cloud, configure your infrastructure using code, and often delegate work to dedicated cloud services to handle the detail of specific tasks for you. As a result it can be all too easy to forget that under all of these layers of abstractions, you have real physical computers, network cables, switches, power supplies, cooling and everything else. Not to mention of course the fact that humans play a critical role in delivering this infrastructure, and you can’t expect them to be infallible.

Our computers are limited by the physical properties of the universe. That’s why your data can’t magically be transmitted instantaneously between two points. But this also means that you have to accept a host of scenarios in which computers you are trying to talk to can’t be reached.

First, the commonplace issues. You want to send data to a program on a computer, but that program has crashed and is not responding. Or perhaps the computer you are trying to reach has suffered some sort of hardware failure. Now, you might be able to reduce the chances of these things happening, but you can’t reduce that chance to zero. I’ll talk more about this in <<??>>.

Sometimes, the computer you want to talk to is there, but you just can’t reach it. Perhaps someone has unplugged the wrong network cable. Or, in one real-world scenario that happened to me, perhaps rabbits have made their home in the cable ducting between buildings, and developed a taste for network cables. Are these scenarios under your control? You might have luck in reducing the incidence of network cables being accidentally unplugged, but good luck controlling rabbits…

But what about issues that might be even less under your control? What happens if a leak in your data center takes out the cooling systems, meaning you have to shut down a number of racks to avoid overheating? Or maybe someone drops an anchor out at sea and accidentally severs a major subsea network connection¹⁰, temporarily limiting access to one of your data centers?

The people who build the hardware you use and operate the data centers your code runs in will do their best to reduce the chances of the failures I’ve outlined above happening. When they do happen hopefully they’ll have mechanisms in place so that these problems are hidden from you. But this cannot be guaranteed.

Sometimes, the thing you want to talk to isn’t there. This is a fact of life, and a key part of building a resilient distributed system is recognizing this fact and developing strategies to handle this problem when it occurs.

So, I’ve explored what a distributed system is, and introduced the two golden rules. Now let’s expand things somewhat, and look at the most critical part of our system - the components which build, operate and use the system itself - humans.

The Human Factor

In 1979, there was a partial meltdown of a nuclear reactor at Three Mile Island in Pennsylvania, USA. One of the contributing factors to the incident was that alarms coming from the systems were confusing, resulting in the human operators not taking action which could have prevented the issue from escalating.

In 1986, a failure in the solid rocket boosters of the Challenger space shuttle caused the craft to fail shortly after launch, killing all 7 crew. It was later determined that the warning signs were there for all to see. Previous launches of the shuttle had shown damage to the O-ring system, something that wasn’t supposed to happen. Despite it being clear that no damage to this component was acceptable, launches continued. Nothing bad had happened when this damage was seen before, so surely the launches could continue?

In both cases, humans were critical to these incidents playing out the way they did. In the case of Three Mile Island the alarm systems left the operators confused and unable to carry out the right mitigating behavior. In the case of Challenger, what occurred became known as “Normalization Of Deviance”¹¹. Despite the system behaving in a way which deviated from the norm, because nothing bad happened initially as humans we can easily be conditioned to ignore the deviance in the first place. The red light flashes, nothing exploded, so the light was ignored. Until it becomes too late.

I will look at both of these incidents again later in the book, but they both show first hand that when considering resiliency, to ignore humans’ involvement is naive at best and dangerous at worst.

The field of study that looks at how to build systems to make them both easier and safer to use is known as either Human Factors or Ergonomics. If you’ve heard the term ergonomics before, there is a good chance you’ve heard it in the context of office equipment, where the focus is in terms of the comfort and health of the office worker. This is part of ergonomics, but it covers much more than that.

Broadly speaking, the terms human factors and ergonomics can be used interchangeably - for the rest of the book I’ll talk about human factors.

So, for a system to achieve the levels of resilience you want, you have to understand how to address human factors not just to make working with these systems easier and more pleasurable, but to also work out how to reduce the likelihood and impact of human error.

In fact, you shouldn’t really think of humans as being separate to the system at all. When you think about our modern distributed systems, the human operators and users are just as much a part of the system as the computers are. In fact, seeing the humans and technology as two parts of the same system is critical.

The Sociotechnical System

Software created by and for a single individual is vanishingly rare. Even in the case where an individual is lauded as being “the creator” of a piece of important software, the reality is often very different. Take the example of the Linux kernel, initially created by Linus Torvalds in 1991. Whilst Linus wrote and released the first version, in less than a year contributions were being made from all over the world, with thousands involved now.

People build software for other people. They operate within a set of processes (formal or informal) using software and hardware, and the work they do is directly impacted by the culture they work in. You can take any complex system and make the same arguments - it’s not just the software industry. This applies equally to car manufacturing, extracting minerals, or constructing a building.

The term “Sociotechnical” was coined to describe any complex system where people and technology work together. In the software development industry we tend to view “technology” through the lens of software and hardware, but more broadly the term describes the structures and processes by which work is done.

When looking at our distributed systems as a sociotechnical system, you have to accept that interplay between the technical aspects and the societal aspects exists. They influence one another. If you are trying to understand why a system failed, or ensure that your system is made more resilient to avoid a future incident, it therefore becomes incumbent on you to look at both the social and technical elements.

Fundamentally, our software systems share a lot in common with other complex systems which humanity has been constructing for decades if not centuries ¹².

The Hexagonal Model For Sociotechnical systems

A model I’ve found useful when trying to grasp the realities of the sociotechnical system breaks things down into a hexagonal model¹³ shown in Figure 1-3. Whilst this model is designed to help understand any generic sociotechnical system¹⁴, this book relates specifically to building and running a distributed system. Therefore I have interpreted the model through that lens.

A model showing the 6 main concepts for a sociotechnical system (Processes, Infrastructure, Technology, People, Culture, and Goals) on the points of a hexagon, sitting within a circle showing the 3 sets of ecosystem influences (Stakeholders, Financial/Commercial Circumstances, Regulatory Frameworks)

The hexagonal model breaks down the idea of a sociotechnical system into 6 discrete concepts:

People: The people doing the work and using the software (developers and users).
Culture: The social environment in which the work is done. For example some cultures might be more risk adverse than others, or might ignore the input from people in marginalized groups
Goals: The incentives for the people creating, operating, and using the system
Technology: The software being written and the supporting tools being used
Infrastructure: Generically this relates to the physical infrastructure at play. This could extend to buildings and roads for example, but in this context will more likely cover computing hardware, networks, power generation etc.
Processes: The mechanisms put in place which guide how the work is done - this could include your preferred style of software development, for example Scrum or Six Sigma.

People, culture and goals are the societal aspects, whereas technology, infrastructure and processes are on the technical side of things. All of this sits within a wider ecosystem of influences over which you will have limited to zero ability to control:

Financial/Economic Circumstances: Your company could be flush from a cash injection from a new investor, or could be operating against the backdrop of a national recession. Or perhaps a global pandemic?
Regulatory Frameworks: Rules and regulations that might apply to the industry you work in. For example adhering to the Payment Card Industry when handling credit cards, or regulations like General Data Protection Regulation (GDPR) for companies operating in the EU.
Stakeholders: This would as a minimum include the people the system is being built for, but could include a wider range of interested parties.

A change to any one of these elements will have a knock-on impact elsewhere. For example, a culture where people are empowered to make local decisions might result in a proliferation of tools being used (technology). Targets which aim for low levels of bugs may result in the people building the software not wanting to report issues when they arise (goals).

I’ll explore this interplay throughout the book. For example I’ll look at culture in terms of the importance of establishing a blame-free environment to encourage information sharing, and how moving away from rigid processes can empower people to build more resilient systems. I will also take you on a deep dive into the technical side of things, looking at failure modes for our hardware (infrastructure), or stability patterns for our code to handle situations like timeouts or services being unreachable (technology).

The Four Concepts Of Resilience

So I’ve looked at a model to help us understand the broader nature of the systems you build. It’s not just about the hardware and software - it’s also about the people and environment in which you create and run our distributed systems.

When you look specifically at resiliency, you need to look beyond the obvious challenges that come to mind. Yes, it can be important to handle problems like a machine being unavailable, or a network being disconnected. But that is an overly narrow view around how to improve resilience.

Luckily, a model exists to help look at the broader aspects of resiliency. David Woods in his paper “Four concepts for resilience and the implications for the future of resilience engineering”¹⁵, outlined a model for resilience that focuses on four core concepts. Briefly, these concepts are:

Robustness: The ability to absorb expected issues, such as handling a machine crashing
Rebound: Recovering after a traumatic event
Graceful Extensibility: How well you can handle the unexpected
Sustained Adaptability: Continual learning and transformation

I will expand each of these concepts next, looking at how they might apply in the context of building and operating a distributed system.

Robustness

Robustness is the concept whereby you put mitigations in place to deal with expected problems. From the technical viewpoint, you have a whole host of issues that you might expect. A machine can fail, a network connection can timeout, a process might be unavailable. You can improve the robustness of your architecture in a number of ways to deal with these problems, such as automatically spinning up a replacement host, performing retries, or handling failure of a given microservice in a graceful manner.

Robustness goes beyond the technical though. It can apply to people. If you have a single person on call for your software, what happens if that person gets sick, or isn’t reachable at the time of an incident? Potential solutions to this problem might be to have a backup on-call person, or a well documented playbook.

Wood’s definition of robustness requires prior knowledge — you are putting measures into place to deal with things you expect to go wrong. This knowledge could be based on foresight — you could draw on your understanding of the computer system you are building, its supporting services, and your people to consider what might go wrong. But robustness can also come from hindsight—you may improve the robustness of your system after something you didn’t expect happens. Perhaps you never considered the fact that your global filesystem could become unavailable, or perhaps you underestimated the impact of your customer service representatives not being available outside working hours.

One of the challenges around improving the robustness of our system is that as you increase the robustness of our application, you introduce more complexity to our system which can be the source of new issues. Consider moving your microservice architecture to Kubernetes as you want it to make it easier to run your microservice workloads. You may have improved some aspects of the robustness of your application as a result, but you’ve also introduced new potential pain points such as the fact you’ll likely need more infrastructure to run Kubernetes itself, or that extensive training will likely be needed to understand how to manage and use it. As such, any attempt to improve the robustness of an application has to be considered, not just in terms of a simple cost/benefit analysis of the initial work being done, but also in terms of whether or not you are prepared for the more complex system you end up with.

A significant part of this book will be dedicated to putting into place various mitigations for the known issues that can occur with distributed systems.

Rebound

How well your system can recover — rebound — from disruption is a key part of building a resilient system. All too often I see people focusing their time and energy into trying to eliminate the possibility of an outage, only to be totally unprepared once an outage actually occurs. By all means do your best to protect against the bad things that you think might happen — improving your system’s robustness — but you also have to be honest and understand that as your system grows in scale and complexity, eliminating any potential problem becomes unsustainable.

You can improve our ability to rebound from an incident by putting processes into place in advance. For example:

Having backups in place to better recover in the aftermath of data loss (assuming your backups are tested of course!)
Write and maintain a playbook that you can run through in the wake of a system outage to help resolve known issues
Clearly define roles and responsibilities for what happens when an incident occurs, so everyone knows who the point person will be and how the incident response process will work

Trying to think clearly about how to handle an outage while the outage is going on is going to be problematic due to the inherent stress and chaos of the situation. Having an agreed plan of action in place in anticipation of this sort of problem occurring can help you better rebound.

Graceful Extensibility

No plan survives first contact with the enemy.

Helmuth von Moltke (heavily paraphrased)

With rebound and robustness, you are primarily dealing with the expected. You are putting mechanisms in place to deal with problems that you can foresee. But what happens when you are surprised? If you aren’t prepared for surprise, prepared for the fact that our expected view of the world might be wrong, you end up with a brittle system. As you approach the limits of what you expect our system to be able to handle, things fall apart. How many companies were ready for the implication of a global pandemic, and the lockdowns that followed?

Flatter organizations — where responsibility is distributed across the organization, rather than held centrally — will often be better prepared to deal with surprise. When the unexpected occurs, if people are restricted in what they have to do, if they have to adhere to a strict set of rules, their ability to deal with surprise will be critically curtailed.

Often, in a drive to optimize our system, you can as an unfortunate side effect increase the brittleness of our system. Take automation as an example. Automation is fantastic — it allows us to do more with the people you have, but it can also allow us to reduce the people you have, as more can be done with automation. This reduction in staff can be concerning, though. Automation can’t handle surprise — our ability to gracefully extend our system, to handle surprise, comes from having people in place with the right skills, experience, flexibility and responsibility, to handle these situations as they arise.

Sustained Adaptability

Sustained Adaptability speaks to the ability of the system to continually evolve and improve. Sustained adaptability requires us to not be complacent. As David Woods puts it in the aforementioned “Four concepts” paper:

“No matter how good we have done before, no matter how successful we’ve been, the future could be different, and we might not be well adapted. We might be precarious and fragile in the face of that new future.”

David Woods

That you haven’t yet suffered from a catastrophic outage doesn’t mean that it cannot happen. You need to challenge yourself to make sure that you are constantly adapting what you do as an organization to ensure future resiliency. Topics I’ll explore later in the book such as effective post-mortems and chaos engineering can be key tools in helping create a learning organization that can adapt as needed.

Sustained adaptability often requires a more holistic view of the system to see where changes need to be made. This is, paradoxically, where a drive towards smaller, autonomous teams with increased local, focused, responsibility can end up with you losing sight of the bigger picture. As I’ll explore in <<???>>, there is a balancing act between global and local optimization when it comes to organizational dynamics, and that balance isn’t static.

Creating a culture which prioritizes creating an environment where people can share information freely, without fear of retribution, is vital to encourage learning in the wake of an incident. Having the bandwidth to really examine these surprises, and extract the key learnings requires time, energy and people — all things that will reduce the resources available to you to deliver features in the short term. Deciding to embrace sustained adaptability is partly about finding the balancing point between short term delivery and longer term adaptability.

To work towards sustained adaptability means that you are looking to discover what you don’t know. This requires continuing investment, not one off transactional activities — the term “sustained” is vital here. It’s about making sustained adaptability a core part of your organizational strategy and culture.

How Resilient Do You Need To Be?

The reality is, that saying “I want my system to be resilient” is on the face of it both a perfectly sensible statement but also silly. Why do I say silly? Well, who would say that they don’t want their system to be resilient? Can you honestly, ever think of a situation like that? Where the lack of resiliency was desirable?

So really the question is likely not “Do you want resiliency?” but “How much resiliency do you want?”. Resiliency, as a quality attribute, is not binary.

Making your system more resilient comes at a cost. You might pay more for a better computer that has redundant power supplies. You may allow for more experienced people to be on call to react to a failure. You might invest to automate manual tasks to reduce the chance of human error. These might all be very sensible actions to take to improve the resiliency of your system, but they all cost.

That cost might be directly financial. A more expensive machine, a larger bill with your cloud provider. It could also be an opportunity cost - an engineer focusing on improving the resiliency of your system is one who isn’t delivering new features. Occasionally the cost can even be felt directly in terms of tradeoffs around usability - for example you might make a decision to require users in the field to use corporate mobile devices that might not be as nice to use as their own phones, but have capabilities that improve the resiliency of your system.

How much resiliency you want (or need) is therefore always a tradeoff. It’s a tradeoff about the cost to improve resiliency vs the likelihood and impact of an incident. Losing a customer’s order for an online shop isn’t great, but a medical device overdosing a patient is much worse¹⁶.

So the need for some degree of resiliency is understandable, but deciding how much needs careful thought. I’ll share this in more detail in [Link to Come].

Summary

I covered a lot of ground in this chapter, content which I will expand throughout the rest of the book. You looked at the nature of a distributed system and learned the two golden rules - that you can’t beam information instantaneously between two points, and sometimes the thing you want to talk to isn’t there.

You were introduced to Wood’s four concepts for resiliency, namely:

Robustness: The ability to absorb expected issues, such as handling a machine crashing
Rebound: Recovering after a traumatic event
Graceful Extensibility: How well you can handle the unexpected
Sustained Adaptability: Continual learning and transformation

You also saw how the distributed systems you create are by their nature a sociotechnical one. You need an appreciation for the technology used to build the system, together with understanding the people who created it. I also showed how all of this is influenced from outside by a variety of factors. As you’ve started to see already, understanding our systems in this broader context is key to unlocking ways to improve resiliency.

In our next chapter, you’ll start our journey into resiliency by focusing initially on the technical side of our sociotechnical system, when I introduce some fundamental technical concepts for system stability.

¹ Around 4.6 billion as of 2023 according to Statista, “Number of smartphone users worldwide from 2013 to 2028”, https://www.statista.com/forecasts/1143723/smartphone-users-in-the-world.

² https://www.bbc.co.uk/news/technology-67514068

³ https://www.thousandeyes.com/blog/internet-report-pulse-update-workday-cloudflare-outages

⁴ https://www.bloomberg.com/news/articles/2024-03-15/mcdonald-s-system-outage-affects-stores-across-asia-and-australia

⁵ The concept of a “computer” can get a bit fuzzy - I will look at things like containers and virutal machines later in the book

⁶ Often microservice interactions are done at a level of abstraction where the network calls might be hidden from the developer - but the network calls are still there.

⁷ A recent example of subsea cables being damaged resulted in multiple countries in central and west africa losing the internet: https://www.theguardian.com/technology/2024/mar/14/much-of-west-and-central-africa-without-internet-after-undersea-cable-failures

⁸ In January 2024, the top level .ru domain suffered an outage, apparently due to a DNSSEC configuration issue https://therecord.media/russia-top-level-domain-internet-outage-dnssec. It’s always DNS

⁹ I’m aware that quantum entanglement appears to achieve this. However at the time of writing we have yet to leverage quantum entanglement as a workable networking protocol

¹⁰ This happens a surprising amount

¹¹ Vaughan, Diane, The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA (Chicago: University of Chicago Press, 1996).

¹² Arguably much longer than that - you could see the construction of the pyramids as a sociotechnical system

¹³ Readers of my previous books will know I love a hexagon

¹⁴ Davis, M. C., Challenger, R., Jayewardene, D. N. W., & Clegg, C. W. (2014). Advancing socio-technical systems thinking: A call for bravery. Applied Ergonomics, 45(2), 171–180. doi:10.1016/j.apergo.2013.02.009

¹⁵ Woods, David, “Four concepts for resilience and the implications for the future of resilience engineering.” 141. 10.1016/j.ress.2015.03.018

¹⁶ I will explore a real world example of this in [Link to Come] when I share the example of the Therac-25 x-ray machine.

Get Building Resilient Distributed Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Building Resilient Distributed Systems by Sam Newman