Chapter 1. Centralized Architecture Practices in a Decentralized World
“Traditional” approaches to software architecture have become increasingly ineffective in the face of our rapidly evolving software systems. In this opening chapter, I’ll describe how these problems originated with a series of fundamental changes—or revolutions—demanding further decentralization and increasing sensitivity to feedback. With these revolutions in mind, I’ll then take you through the problems caused by the reliance of traditional architectural approaches on predictability and control. I’ll conclude by setting out what an approach to architecture needs to incorporate, focusing on what is within our power to plan for, protect against, and respond to.
Let’s begin by considering the value of software architecture, both as a practice and as an end result.
Both the Practice and the End Result of Software Architecture Are Essential for Success
What is software architecture? For Grady Booch, “[Software] architecture represents the set of significant design decisions that shape the form and the function of a system, where significant is measured by cost of change.”1 Martin Fowler has a similar take that can be paraphrased as software architecture being those decisions that are both important and hard to change.2
I like these two definitions because not only do they encompass what software architecture is, they also highlight the greatest difficulty with the concept of “software architecture.” Booch’s definition leans toward the end result while Fowler’s leans more toward the practice. In fact, the term software architecture can refer to one or the other, or both at the same time. Rather confusingly, we tend to use the meanings interchangeably when we talk about the software we build. For any software system to be successful, both the end result and the practice are essential.
Many books have been written about that end result of software architecture and about the significant design decisions that shape the form and function of a system. New technologies are developed all the time that require changes to existing architectures or create entirely new ones. This is where patterns3 and reference architectures4 come into play, capturing proven ways to construct our systems.
You might assume that there would be an equal number of books about the practice of software architecture. Sadly not. This is because the way software architecture is practiced has changed very little over the years.
But the stakes are rising. Software architecture is becoming increasingly important as the complexity of our running software systems continues to rise. Our systems are larger, change more often, are more connected (both to things we own and things we don’t), are used by more people more of the time, and are expected to never fail.
Although it is widely accepted that the practice of architecture is important and that having bad architecture should be avoided, only the end result has seen innovation and improvement over the years. For future success, we need to be open to evolving our practice of software architecture too.
This book redresses this imbalance. It describes an approach to the practice of architecture that reliably achieves and sustains the architectural end results that our software needs, maybe even improving the future of software development.
Let’s begin by establishing the goals of good software architecture in our rapidly changing and unpredictable technological landscape. Systems with good architecture have:
-
Individual parts that are coherent, cohesive, and aligned both to the domain and to business value
-
Individual parts that are decoupled in the right way so that multiple independent teams can work on the overall system together and in parallel
-
Half an eye on future evolutions, creating overall architectures that are sufficiently adaptable to change
Given these qualities, the definition of a suitable software architecture practice is trivial: it must deliver and maintain these “good architectures.”
When appropriate approaches to the practice of software architecture are done well, I have seen them have a massive positive impact. You can see and feel it in the production of the software and the end results. Building and running code becomes efficient, predictable, easier, even joyful. When executed poorly, software architecture practice becomes wasteful, unpredictable, and a struggle—in short, not much fun.
Let’s take this lens and use it to examine current software architecture practices, considering in particular how they cope with modern software architectures. (Spoiler alert: they don’t fare very well.)
What Are the Practices of Traditional Architecture?
As I’ve said, the standard practices of software architecture haven’t changed very much over the years; you’ll probably recognize them as soon as I describe them. They’re so ubiquitous that I am going to refer to them from now on as traditional. Although these practices fall under two extremes, which I’ll discuss shortly, they share one common aspect. Traditional software architecture practices concentrate the power to make decisions in the hands of a select group: the people called architects. Architects are responsible and accountable for all significant architectural decisions.
Remember this, as it’s crucial to this chapter’s discussion.
Now for the two extremes of traditional architecture practice. To be clear, neither of these stereotypes ever manifests fully in reality, but the fact that our discipline has names for them—which we all recognize—means that they exist conceptually for all of us.
Ivory Tower Architects
On the one end, you have the ivory tower approach to architecture practice, illustrated in Figure 1-1. In this approach, multiple teams are building software and trying to flow. When a need for an architectural decision arises, they must seek it from the architect in the distant ivory tower.
This approach tries to influence, even control, everything about both the parts and the whole. By this I mean it tries to have opinions on, and input into, individual software components that deliver specific features while also trying to look after all the components as they interact together. To achieve this, ivory tower practices look at everything from a variety of standpoints and perspectives, with a range of timescales in mind, and apply a series of constraints to keep the whole together, all the way to production.
The most significant failing of ivory tower architecture practices is that they overfocus on the whole at the expense of the individual parts and the teams that are trying to deliver and run them. This is necessary for them to be able to cope with all the variability. For example, ivory tower approaches mandate that everyone stick to a given architectural pattern, even when it doesn’t fit in certain specific circumstances, because that pattern ought to make matters more predictable and thus—so the logic goes—easier to control.
Software architects practicing this ivory tower approach are consequently seen as above everything, all-knowing, and able to survey the whole scene. It is assumed that such architects have vast arrays of experience, traditionally manifesting as being superior in the organizational hierarchy—another reason they are viewed as “high up.” These preconceptions are manifested in the term we use to identify this approach. Ivory tower architect implies that their architecture approaches are too far removed from the code to have a sense of the grubby, day-to-day reality.
Hands-on Architects
But surely being “down in the trenches” is also important? It is, and it is represented on the other end of the architecture practice spectrum: the hands-on, cross-team approach to architecture illustrated in Figure 1-2. Again, multiple teams are building software and trying to flow. When the need for an architecture decision arises, they must engage a roaming hands-on architect who meets them in their context.
This approach reflects an architect’s important determination to retain and prove their ability to code. The practice sees the work of these architects as happening “at the code-face,” among the developers, moving from team to team, asking all the right questions and co-designing with team leads and senior devs. In direct contrast to the emphasis of the ivory tower approach, the hands-on approach prioritizes the experiences of individual teams and very much wants to make sure that the individual parts are deliverable.
Hands-on approaches require rushing around, spinning from team to team, never being anywhere quite long enough to do all the work that’s needed. As a result, the overall system suffers. The overfocus on individual teams is this extreme’s weakness because the “whole” is still of great importance, yet it is not treated as such. What is needed is an approach that can balance both “up-close” and “big picture” perspectives.
What’s Wrong with Both Traditional Approaches?
Although these generalizations are intentionally polemical, they both fail for the same reason: they attempt to control the uncontrollable and predict the unpredictable.
I’ve practiced ivory tower approaches, and hands-on approaches too. I’ve also been a development lead trying to partner with both types, so I’ve seen how things can fall apart in different ways and from many angles.
As an architect working in the ivory tower, I’ve striven to keep both the big picture and the details in check. I’ve been the one trying to articulate the importance of fundamental information security and regulatory requirements (which we had to conform to for boring but existential reasons—not being compliant with various regulations can do that to a business) while a cool new library or approach is proving far more exciting to development teams.
As a jobbing hands-on architect, I’ve loaded my brain with all the context and domain information I can. I’ve moved from team to team, struggling to context switch, to bring to mind the vast number of details of their current focus, their history, the specifics of their part of the business domain, their specific pipelines and runtime needs, while listening to what they tell me about a particularly weird traffic blip, while trying to make sure the overall API we expose publicly has some degree of consistency.
In my experience, all software architects—no matter how they go about their practice—are sincere, applying their unique mix of the traditional approaches (and more) with the best intentions. But in doing so, architects are trying to achieve complete control because, regardless of their approach, they are responsible and accountable for the entirety of the architecture. By trying to juggle all expectations perfectly, they lead themselves into trouble.
With ivory tower approaches, this means having review boards, sign-offs, dedicated architecture functions, and straightjacketing frameworks. And the hands-on alternative? Close pairing, moving from autonomous team to autonomous team, trying to do all the required design in all the places it is needed. Neither works well.
Why not? Because, by design, ivory tower and hands-on architects are at best a drag on teams, and at worst, they lead to bad architecture and bad software.
With either ivory tower or hands-on approaches, when my architecture work is good and I communicate it effectively, the downside is simply that I slow teams down because I become a bottleneck—and that’s the best outcome. With development teams expecting to move faster and faster, I can unintentionally block their flow, which means they sit idle, waiting for my input.
Frequently, the situation is worse still. As teams become increasingly independent and are able to deploy to production with greater frequency, they know far more about their parts of the system than I ever could as their architect. They know their requirements in more detail, they know their code better, they know their domain better, they know their pipelines and runtimes better, and they know their customers better. Consequently, I rely more and more on the teams to share the latest state of all this contextual information with me. The more they have to do this, the more it becomes impossible to think through everything required to play my role: that of representing the whole, the sum of all these independent parts.
I’ve striven to deliver software that meets the requirements, that delivers value to both the users and the business, and that evolves and changes as both needs and technologies evolve and change. I’ve tried incredibly hard to make both of the traditional architecture practices work, separately and together, but never with enough success to mitigate being a blocker and failing to be aware of every nuance of an increasingly complex system.
We find ourselves in a world where our software systems are made up of increasing numbers of independent, decentralized, and rapidly changing parts that are not always known or fully understood. Let’s take a closer look at the kinds of software we build today and the ways in which we build it. With this clarity, it becomes possible to identify what qualities a suitable practice of software architecture needs to have.
Five Revolutions Unlocked the Power of Software
Change is a constant in the world of software. There is always a new tool, language, platform, technique, approach, pattern, antipattern, right way or wrong way, organizational model, or shift in values to keep up with. The desire to use or do the latest thing is so embedded in our software culture that we must frequently be reminded to actually deliver useful and valuable outcomes. In fact, change is such a feature of our lives that Chapter 13 is dedicated to tackling the challenges it offers.
Over the years, responses to this need and desire for change have surfaced as a series of revolutions that have altered our relationship with code and, consequently, our software architectures. Each revolution challenges key aspects of our software-creation ecosystem and offers us a new way of seeing the world before us and what is possible. (You might not be experiencing all—or even any—of these, but the opportunity exists for us all.)
There have been five revolutions so far (shown in Figure 1-3), but doesn’t mean there won’t be others in the future. (You can already see storm clouds brewing in the area of organization design. And who knows what machine learning might bring.)
Note
My use of the term revolution to describe the shifts in our relationships with code and software architecture is deliberate. You can think of a software architecture as revolving around an axis, which can have the effect of completely upending an established mindset. Although a rotation can bring us back to where we started, it will have caused us to view the world in a different way.
The first revolution came with the Agile Manifesto. It encouraged us to return our focus to running, tested code5 and trust in the humans doing this work. It highlighted the power of techniques such as test-driven development, pair programming, and continuous integration to help us achieve this.
The second revolution came with cloud computing. The increasing ubiquity and plummeting price point of network, storage, and compute cycles made us realize6 that we didn’t need to own the computers we ran our code on and that we could pay someone for time on theirs instead. This changed our perspective on just how fast we could set up and change our systems in production. No longer did we have to wait weeks (or months even) for a new server to be delivered in a box, unpacked, racked, cabled in, and an operating system deployed. After the second revolution, we could achieve the same thing within a few minutes, installed via the command line,7 and our code could be running on it very soon afterward.
The third revolution came with DevOps and continuous deployment. It became clear that there was no reason to keep treating the people feeding and watering our cloud runtimes as having a different culture, a culture that we could interact with only via ticketing queues. This shift, which broke down artificial boundaries between roles in Development and those in Operations, reshaped just how soon “soon after the code is complete” could be and who might be initiating the deployments. Operations teams embraced the tooling that developers took for granted and used them to build, deploy, and maintain services, which in turn allowed delivery teams to self-serve all the infrastructure and pipelines that automated all the steps between production and our machines. DevOps and continuous delivery encouraged us all to break down the silos and see our systems running in production, learning from how they both succeeded and failed.
This direct exposure to our production-running systems drove the fourth revolution, which came from product thinking. Product thinking helped us see that there was no guarantee of problem-solution fit or product-market fit unless we heard from users. For example, when I seemed too confident in my work, my colleague Monira Rhami, a product manager/CPO and ex-developer, would often challenge me by asking, “But how do you know your work actually works?” It was always a great point. Until my work was running in production and being used, I had no way of being absolutely sure of it. With product thinking, no longer did we have to build something and recklessly hope that it was valuable. Now we could write just-enough code, ship it, and prove the value from the feedback.8
Most recently, the fifth revolution of stream-aligned teams showed us that we were still getting in our own way. There was nothing so annoying as having two teams with conflicting priorities and imbalances of information delivering two halves of a feature on a shared codebase. Far better to put all the elements and information they required in the hands of a single team and let them get on with it. Feedback was improved yet again because the team owned the customer experience end to end, and it was far more direct.
Of the five, this last one had the longest uninterrupted gestation period—perhaps because it was the last to arrive and the one most extensively built on its predecessors. It was highlighted early on by people like Eric Evans in Domain-Driven Design (Addison-Wesley) and Donald Reinertsen in The Principles of Product Development Flow (Celeritas). James Lewis then made it increasingly actionable with his articulation of the microservices pattern. Only recently, however, has the concept of having stream-aligned teams come to individual prominence, in great part due to the work by the DORA Report/Accelerate team, by Marty Cagan in his book Inspired: How to Create Tech Products Customers Love (Wiley), and by Matthew Skelton and Manuel Pais with their Team Topologies approach.
The idea of flow referenced in Reinertsen’s book title, and also a constant theme in Team Topologies, is much beloved by people in the product world and refers to a fast-flowing river of value with nothing to get in its way. The search for flow means friction is constantly being removed, making feedback faster and more valuable.
Running throughout each of the software revolutions is the idea that nothing should impede a feature idea from getting in front of users. With flow, that feature can rapidly become a set of thinly sliced stories in a single team’s backlog, then a commit to source control triggering a deployment that, within a matter of minutes or hours, could be in front of a user and either meeting their need (success) or not (failure). If we succeed, we can move on to the next incremental user story. If we fail, we find out why and course correct.
Diana Montalion calls the sum of these revolutions “the new physics”, and that’s the perfect term for it. Collectively, they undermine the Newtonian certainties of software development, and previously unimagined possibilities open up before us. When these revolutions are combined, months of work can be collapsed into minutes. Guesswork and hope can be turned into experiments, data, and facts. The power of running code in production and gathering feedback is more accessible now than it’s ever been.
But the revolutions didn’t make everything easier. Although they brought significant, rapid, and arguably positive change, they also left architects scrambling to adapt, struggling to hold everything together.
The Effects of the Five Revolutions on Architecture Practice
Although the software revolutions brought us the changes we needed and desired, they also brought us more architectural complexity. We have increasingly decoupled and autonomous parts of software systems, but they still must be able to work together as a cohesive whole. Additionally, the right domain logic (the real-world business rules encoded in the software) must belong exclusively and entirely to the right teams, and those teams must be sufficiently decoupled from all the others. All the while, “just enough” architecture needs to happen to enable all parts of the system to meet all the identified needs as well as change direction as needed. And all this needs to take place while listening to the feedback that the running systems provide from production.
Without knowing of an alternative, architects would be under the impression that they’d need to make do with traditional architecture practices, depending on the same hierarchies and adhering to the same traditional cadences, ceremonies, toolkits, and intervention styles. These traditional architectural practices are problematic because they are rooted in power and control. The entire arsenal of legacy architectural tools, processes, and techniques is geared toward controlling certain aspects and keeping that control in the hands of a chosen few so that everything can be managed. This has led to a clash between traditional architecture practices and modern architectures that embrace decentralization and reward adaptability.
Let’s take the time to properly examine the effect that traditional practices have on the needs of modern software architectures.
The Rise of Decentralization
As a result of the software revolutions, our software architectures have become increasingly decentralized. Decentralization allows for more robust and future-proof systems. There are three fundamental aspects of decentralization that support our modern software needs.
First, “decentralization” is not “distribution.” Distribution is when you take something whole and split it into parts, which you then spread around. When you do this haphazardly, hoping that it will help performance, you can do more harm than good. I’m referring to the kinds of distributions where components are sliced apart and put on separate servers with little regard to their relatedness. Let’s say you distribute a monolith with a poor microservices structure. Suddenly, performance drops through the floor, timeouts are happening everywhere, (distributed) transactions are either taking ages or continually rolling back, data is probably in an unpredictable state, and you need to make changes to 17 different repositories to make a simple functional change.
This form of distribution is a bad idea because parts that were once close together could be placed elsewhere without verifying if the parts could handle the separation. To address this, you might put the parts on two different threads of execution, which are given two different lifecycles, and force them to stay in sync over an unreliable and slow network connection. This isn’t ideal either.
Decentralization, on the other hand, is when you identify coherent and complete elements that could be isolated and run separately from the greater whole. An example of this might be where you can separate logic that creates orders from logic that fulfills orders, enabling them to be packaged and deployed as separate microservices. This allows them to have their own lifecycles and respond in ways that are appropriate to them and them alone.
Second, by decentralizing, you, as an architect or developer, are surrendering centralized control and accepting that things will need to be managed in a different way. If you split things up without surrendering overall control, you’re not decentralized—you’re just distributed. Anything less than a full commitment to decentralization will make your life harder, not easier. In fact, with distribution or partial decentralization, you need to be much more aware of where control is needed so that you can focus on it and protect it but also minimize it, allowing elements that can be allowed to change independently to do so.
Take the languages and versions of frameworks that teams use when adopting a microservices approach, for example. One of the great benefits of microservices is that teams can make their own choices. As long as the microservices can run on the corporate cloud platform, send logs and metrics to the monitoring frameworks, and don’t break any open source licenses, then the teams should be allowed to choose their own. This means that teams can move at their own pace, freed from having to match the pace of the slowest team, which is, despite best intentions, stuck on a Java 1.4 JVM due to a dependency on an ancient Xalan jar, which is needed because the biggest customer can’t upgrade its old SOAP API because its last developer left years ago.
Third, decentralization will increase overall system complexity. Complexity is not the same as saying something is “complicated.” A complicated situation can be tackled, typically by simplifying things. A complex situation has the potential to tip out of control. A complex situation emerges without anyone intending it to happen—both good things (like network effects) and bad things (like cascading systemic failures, such as the 2008 credit crunch)—because everything is interdependent.
In our incredibly networked world, complexity rules our systems. We surrendered control of the parts of our systems that didn’t increase our competitive advantage (also known as “undifferentiated heavy lifting”) in the form of our data centers, our customer relationship management (CRM) solutions, our SMS notifications, our web frameworks, and our continuous integration (CI) engines. And why shouldn’t we? Why should we build and maintain these things ourselves when they give us no unique value? To have a sense of “control” is not a sufficient justification. As such, we embrace complexity to reap the benefits.
There are three aspects to this: how decentralization is best for teams, how decentralization is best for modern software, and how the benefits of both are realized only if these two decentralizations are aligned.
Teams work best when decentralized
Ask any software development professional—fresh and keen or seasoned and maybe a little skeptical—and they can tell you about the many, many hurdles between their code and the production environment. Who put those hurdles there? Perhaps the fault lies in the organization and its power structure. Perhaps it’s in management. Perhaps it’s on another team. Perhaps it’s priorities. Perhaps it’s a colleague. Perhaps (infrequently, though you do hear it) it’s themselves.
Whatever the source of these hurdles (real or imagined), what these developers are specifically talking about are couplings that block flow. These couplings typically take one of two forms:9
-
Work couplings (“I’m waiting for them to do their thing so I can do my thing.”)
-
Permission couplings (“I’m waiting to be told it’s OK for me to do this thing.”)
These couplings that block are frequently the remnants of centralized, prerevolutionary approaches and practices, and they hurt most when applied to decentralized organizational structures and software architectures. Therefore, teams should use decentralized practices that came about with the various revolutions; to do one without the other is a recipe for failure. How can you achieve this? It’s all there in the revolutions.
The Agile Manifesto principles, for instance, encourage us to: “Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done.” Various Agile and Lean methods then hint at how we might go about this. Complementing this is the broad range of DevOps practices, such as continuous delivery, which puts the power of building and running our software firmly back in the hands of those who write it, without the need to coordinate with other teams nor being forced to work at the cadence of a centralized release manager or change advisory board (CAB).
By applying decentralized practices, build-and-run teams can now see firsthand their code being put to the test in production. Product thinking paired with this real-world feedback allow us to see our software from a different perspective, through the eyes of the user. This in turn enables us to ensure that our work actually provides value—enough value that someone is willing to pay for it. And if we find that the product we are working on isn’t valuable, we can change gears and work on building things that actually are valuable.
In essence, we are able to align our teams to our business or our product, which is exactly what thinkers such as Mel Conway10 and Eric Evans11 suggested decades ago. My colleague James Lewis also has a nice way to express this: “The business and its organization should be isomorphic.”
By employing tools and practices brought forth by each successive revolution, we are incrementally shortening the time between the writing of our code and the moment when our code will be executed by an end user in production, and in doing so, we are incrementally increasing the power and actual usefulness of our code. We are also stepping (consciously or unconsciously) toward independent, self-managing, self-organizing, and self-sustaining teams, with everything they need to get the right running, tested code into production while driving value and learning from mistakes. Like it or not, this is the future of most software systems.
When they are done right, all these new tools and practices are profoundly beneficial. I’ve been involved in transformation after digital transformation where we broke down Taylorist silos12 and erected benign boundaries based on user segmentation and domain cognitive load.13 I’ve seen firsthand that reduced work and permission coupling improves flow and delivers the best software the most efficiently. It is also sustainable. Teams working in this way are happier and less burned out.14
Modern software works best when intentionally decentralized
For software to meet our demands, it has almost always needed to be connected to other components, often a database over some form of network. Once our software was connected to that first external component, splitting out further elements for various reasons didn’t feel like a great leap. First, we put client code down on desktop PCs, and then, with the rise of the World Wide Web, we moved frontend code to a new “web tier” and then all the way down to browsers. These leaps weren’t decentralization, though. Not yet. Decentralization came when we discovered that we could also consume external independent services for more functionality.
With the benefits of these independent external services, it became clear that decentralization seemed like a sensible idea. If we could slice our systems into small, independent pieces, we could realize systems that were potentially less susceptible to catastrophic failure and other plagues of tight coupling. If we minimize the coordination of various parts, then everything and everyone doesn’t need to be in sync or agree all the time. Admittedly, this kind of decentralization makes everything more complex, but that complexity can be bounded (for example, by putting code for different services in different repositories), and we could then reap the benefits of availability, resilience, performance, and scalability.
In 2000, I worked at Sun Microsystems in Linlithgow, Scotland. Every day, I would pass under a massive sign with the slogan “The network is the computer.” Although it was revolutionary for a vendor to have this vision back then, it’s now the reality of our world as consumers of various networked services as well as software professionals.
These days, all the biggest and most powerful software systems take advantage of these architectures, and no one thinks twice about architecting one consisting of various autonomous-from-the-outset subsystems—systems that bring in third-party services and expose APIs to teams outside our firewall and organization. We can confidently use message queues and pub/sub patterns and ensure that, when distributed globally across multiple, virtual (cloud) data centers, the right elements are kept in sync, but no more than is necessary.
Decentralized teams and their software must be aligned
“Rubbish!” I hear you cry. “We moved to continuous delivery, and it made everyone’s lives a living nightmare! And don’t get us started on microservices…”
I understand what you mean. Building a decentralized architecture is hard work. (I spend my life helping clients with this precise problem.) It’s even harder if teams are not decentralized in parallel with that architecture—and impossible when the traditional approaches don’t support the practice of architecture to realize it.
Decentralized teams deliver and run decentralized software. Decentralized software is delivered and run by decentralized teams. If the two are not aligned, then teams will be in a constant state of conflict, and coupling will abound. Perhaps it’s a tautology to say it. But what sounds like common sense is, in my experience, far from common.
In his 1968 paper,15 Conway said that the human and organizational aspects of software delivery do have a massive impact, but (because we’re technologists) we put all our focus on the software and hope that the human part will take care of itself. But we humans can subconsciously put up a fight, sticking to our old, Taylorist worldviews. Whether you’re an architect or a software developer, this book will give you the tools to facilitate alignment and collaboration on software architecture practices between architects and teams.
The Fall of Centralized Architecture Practices
In “Teams work best when decentralized” I touched on how decentralized teams reduce coupling and, therefore, block less. In other words, they make better use of their available resources, such as time. The same is true of software architecture itself. Let’s look at the inverse: overly centralized software architectures are blocking and therefore inefficient.
Blocking—the method of pausing execution of a task while waiting for some condition to be met—can manifest in a software system in many different ways and in many different parts. (Remember the “work and permissions couplings” I mentioned previously? This is the software architecture equivalent.) For example:
-
At the network level when awaiting responses from HTTP endpoints.
-
At the database level when rows (or worse) are locked.
-
At the OS level waiting for a process call to return.
-
At the cluster level waiting for a quorum to be established or a new leader to be (s)elected.
-
At the deployment level, when your work is not packaged as a single independent deployable and instead must make its way to prod as part of a bigger whole, you must block. In this case, you wait for everyone’s automated tests to complete and the next deployment window.16
As professionals concerned with building, running, and sustaining a quality product, we need to ensure value for money. Blocking gets in the way of that because blocking is a source of great waste. Blocking creates queues and is a warning that, while one small set of resources is probably working very, very hard indeed, many other parts of the system may be idle, tapping their virtual fingers (and perhaps timing out) waiting for their turn to come.
As with decentralization, there are three aspects to the failures of centralized architecture practices: the blocking of delivery flow, the failure to factor in sufficient feedback, and the fact they are incompatible with decentralized architectures.
Traditional architecture practices block delivery flow
Now let’s turn our attention to the human element in traditional architecture practices. Centralized software architecture practices are a prime cause of blocked delivery flow.
These blocks turn up as bottlenecks around architects and the decision queues that consequently form in front of them, impeding the flow of work. I’ve seen these blocks to delivery flow even in innovative companies and in highly effective teams where technical skills are not in question. In the end, it comes down to poor use of resources arising from unnecessary permission coupling—in this case, waiting for an architectural decision or approval of one that has already been decided.
Why do the traditional practices of architecture cause this blocking? I believe it’s because although we technologists are happy to think and act creatively over and over again in the name of technical optimization, we rarely use the same skills to think critically about ourselves and how our practices might be leading to human coupling and use that insight to challenge the existing status quo.
Why are we only now feeling the pain caused by these permission couplings in traditional architecture practices? Because the flow rate of software delivery has increased rapidly and significantly. Prior to the first revolution (Agile), running tested code was not seen as the primary goal, so we had longer to think. Prior to the second revolution (cloud), we had a pretty good idea what the runtime platforms would look like months or years in advance, so we could incorporate that into our designs. Prior to the third revolution (DevOps), deployments were larger and less frequent, so we could work in bigger chunks. Prior to the fourth revolution (product), we could rely more on a set of (relatively) static functional and cross-functional requirements. And prior to the fifth revolution (stream alignment), teams were aligned along technical rather than domain lines, so we had to context-swap less when running between them.
The cumulative result of the software revolutions to date is that more and more teams expect to take greater numbers of their independent features more frequently all the way through to production almost immediately and without any undue friction; to them, it feels perfectly natural. This is great news for product managers, who always want users to get their hands on the software. This incremental ratcheting up of deployment cadences, across more and more independent teams, means that the practice of architecture is called on in many places, all at once, over and over again.
Yet the traditional architectural practices still require all decisions to be passed through the eye of the architecture needle, to gain the mandatory “responsible architect’s approval.” While the aim of this was to allow the architects to deliver on the three goals of good architecture (appropriately coherent and cohesive, suitably decoupled, and sufficiently adaptable), it also means that architects are swamped. You can see this represented in Figure 1-4.
This means architects, who now are outnumbered by the amount of changes making their way to production on a regular basis, have two choices: do as little as possible and hope that the software teams deliver on good architectures on their own, or do enough to ensure the three goals are met in each and every change and block the flow in the process.
Architects trying to do architecture in the traditional ways are now the biggest blocker in more and more organizations because their practices cannot cope with the volume, variety, and cadence of changes brought about by the five revolutions given the size of systems today. (And that’s without even trying to keep up with the pace of technological change.)
Traditional architecture practices fail to factor in sufficient feedback
Blocking is the first way that traditional architecture practices fail, but it is not the only one. They also famously fail to factor in sufficient feedback for the architects who use them (see Figure 1-5).
At the beginning of this chapter, I talked about how the ivory tower extreme of architecture practices fails to factor in sufficient feedback from teams. These practices contain painfully few built-in mechanisms to gather feedback that rolls up to architects and subsequently affects their decisions.
Hands-on architecture approaches came about in great part as a direct response to this problem after practitioners heard the howls of pain from teams and moved to remedy their problems. Hands-on practices have architects visiting teams that need their assistance, doing some co-design with team members, and perhaps even some pair programming on a bit of skeleton code. By doing so, they may experience how easy or hard a new feature is to implement in the existing codebase, and consequently, they are open to the feedback from the code.17
This is a vast amount of information to ignore, and in the remaining sections of this chapter, I’ll explain why it’s essential to factor in feedback when decentralized software is being built by decentralized teams.
Traditional architecture practices are incompatible with a decentralized world
While we can build better systems faster as a result of the software revolutions, we can also deliver distributed software, which potentially exposes us to catastrophic failure modes, triggered by systems and individuals we have no awareness of, let alone control over. Software architecture is frequently the source of both success and failure.
Not only that, but a well-architected system can be hospitable to build and evolve, self-healing, elastically scalable, and incredibly resilient—able to operate even during partial failure. A poorly architected system, on the other hand, can be unmaintainable, debt ridden, expensively underperformant, and terribly brittle.
In a perfect world, hands-on architecture practitioners would be able to split their attention across all teams, giving them all individual attention, and still keep an eye on the architecture as a whole. In a perfect world, ivory tower architecture approaches would factor in feedback from everyone and everything. But we don’t live in a perfect world, and ivory tower and hands-on architecture practices are both limited by their methods of engagement. Within a decentralized system, they both need to deal with all the independent moving parts and all the team relationships as well as address frequent changes to everything and anything.
Attempting centralized architecture approaches in a decentralized world is an impossible task. An alternative is needed.
What Must Any New Practice of Architecture Provide?
Whatever the approach, we need architectures that deliver on the three goals of good architecture. They must be appropriately coherent and cohesive, suitably decoupled, and sufficiently adaptable.
We need to go back to a blank sheet of paper on the drawing board and imagine other ways to achieve these ends. Specifically, we need a completely new—complementary—system where our architecture practices work with the dynamics and cadences of both our teams and our software systems, rather than fighting them. And as we have seen, this new system must have two key aspects: it must be decentralized, and it must incorporate feedback at its core.
It needs to be decentralized to allow independent teams to decide on architectures with minimal coordination. This means more architecture is happening in parallel. If you can do more architecture, you can unblock the writing of more code, optimize flow of delivery, and consequently have more of your architecture running.
It also needs to provide everyone with direct and rapid insight into the emergent properties of this architecture as it runs. This information must then be “fed back” directly into both architects and decentralized, independent teams, allowing everyone to cut through the complexity, understand it, and respond to it in terms of further architectural decisions and implementations.18
A new approach to architecture must factor in certain forces to be truly successful. Although no approach can protect against the forces of chaos, it should:
-
Embrace uncertainty
-
Allow for emergence
Let’s work through each force, starting with what architecture practices cannot ever achieve.
No Approach Can Protect Against the Forces of Chaos
No architecture practice can protect against chaos. When I use the term chaos, I mean in the context of physics. According to the Oxford English Dictionary, chaos is “the property of a complex system whose behavior is so unpredictable as to appear random, owing to great sensitivity to small changes in conditions.”19
I’m sure that definition of chaos sounds familiar to many architects and developers alike. Neither the architectures you conceive nor your teams that build them could be described as “predictable.” Perhaps “appearing random” feels like a stretch, but I bet that if you were asked to list out all the factors that might have an impact on each and every system, and then tried to enumerate the potential ramifications of all of them, and then tried to think about the interactions of the results of all those interactions, you’d end up with way more than anyone could cope with. And those are only the factors you could think of. Things are combinatorially complex. Add just one more thing and the number of relationships it has with all the other existing things explodes.
What if we then asked your colleague? Would they come up with a few relationships that you’d forgotten? Inevitably.
It doesn’t take many independent parts to unleash something far too complicated to reason about exhaustively, let alone predict how it is going to operate. Therefore, any new approach to architecture must not attempt to exhaustively predict how systems will operate and protect against all potential ramifications.
Architectures Should Embrace Uncertainty
Decentralized architecture, as a combination of apparently simple elements, including independent teams and their independent software modules, gives rise to uncertainty. Where does that uncertainty come from? Variability. Variability, as I’ll discuss in more detail in Chapter 13, refers to the unknowns in software development that need to be addressed—unknowns that are present throughout a system’s cycle of evolution.
Consider this: you have a system that is made up of a collection of interacting components. These are not only the components of the software that you’re building but also the teams that are building them, the tooling they use to deploy them, the processes used to work together to build and run them, the infrastructure they build and run on, and the external, third-party dependencies that the software depends on. These are all important parts in what is commonly referred to as the sociotechnical system that is engaged in building and running the software.
Even taken alone, each part is neither wholly understandable nor predictable. Each can be a source of variability. Combined, they get far worse because when you interact with something, you act on it, but it also acts on you. It’s a two-way action, one that changes both parties.
Perhaps even now you are thinking, “But if I have two components, and one calls the other, I can predict the outcomes of that operation in my head.” At this point, I need to highlight a most important concept that gives rise to system complexity.
Let’s take an artificially simple scenario of A and B interacting. In this formulation, “A causes something to happen in B” is linear thinking. What happens in reality is that the consequent actions of B will also result in something happening back in A.
For example, let’s imagine we’re talking purely about a technical system: a synchronous interaction such as an HTTP GET. After it sends the request, A will be waiting for B to respond. If B has what A wants, then A will receive it, probably in the form of a “200 OK” response with a payload. If it does not, then A will have to respond to those eventualities too: “404 not found,” “500 internal server error,” and so forth.
Even under this “normal operation” scenario, you can see that there are a few things going on. We have to add only a few other possibilities for things to get complicated. What if B takes a long time to respond, and before it does, A performs a retry? B is now (potentially) performing the operation for A two times over. Does it give A two of the things it requested? Or does it give A the same thing twice?
Perhaps, instead of a retry, A stops waiting and goes off to try another way to achieve its goal. That means that when B eventually returns the thing to A, A isn’t around to do anything with it. What does B do then? Does it care? Does it put some resources back in the pool? Does it even know that A didn’t care any more?
By asking these questions, we are opening our complicated, artificial, tech-only world up to the sociotechnical world around it. Why did B take a long time to reply? Was it because it was already loaded with other requests? Similarly, why did A stop waiting? Was it because its needs were met elsewhere? Or did it simply get bored? You can see that, even from this incredibly simple example, there’s a lot that we need to think about when things interact, and our systems are never this simple.
But there’s more. If we add further well-understood components to the picture, each of them has the potential to interact with all the others in a myriad of different ways, and maybe not even directly. Here’s an interesting exercise to try: introduce some latency to a single part of any system you currently run. It doesn’t matter if the part is trivial or if it’s on the periphery. See what happens. You might be surprised at how simply slowing down one thing can break seemingly unrelated things.20
Any new approach to architecture must acknowledge—even incorporate—information from as many of these component interactions as possible to respond to the variable, unpredictable eventualities that inevitably arise.
Architectures Should Allow for Emergence
What could be more surprising than coming across properties in a software system that you didn’t design? Properties that didn’t arise from a single part and instead come into being because of the interaction of multiple parts?
This phenomenon is generally called emergence, which can be described as follows: when an entity is observed to have properties its parts do not have on their own, properties or behaviors that emerge only when the parts interact in a wider whole.
Emergence can manifest in two different ways, which are distinguished as strong and weak. Both are important in how we architect our systems.
Strong emergence is where we live and work as software professionals every day. It is the phenomenon where a new, higher-level entity is created by the combination of multiple, separate elements. This higher-level entity emerges from the combination of these individual parts. A car is a good example of this, or a human body, or a distributed software system.
Weak emergence also happens to us all the time, but it is different in two key aspects: it is only self-evident after the fact, and the individual parts remain independent when it acts. Let me give you a real-world example to help you grasp this idea.
I once worked on a system that exposed APIs to a number of well-known online retailers, such as eBay. Our APIs allowed eBay’s sellers to buy postage-paid shipping labels as per their customers’ shipping requests. Our APIs were simple, as were the few microservices that sat behind them. My team felt smart because, when planning, architecting, and performance testing, we had the demands of Black Friday in mind. We knew traffic around that time would be the greatest our system would have to endure, and we did a bunch of stress testing specifically to ensure we could be confident the system would be able to cope.
What we didn’t know—and thus didn’t plan, architect, or test for—was the more regular, far lower volume, and, frankly, much less exciting behavior cycle of eBay sellers. It was this client behavior that gave rise to some simple but interesting emergent behavior that we could not have predicted.
eBay sellers in the UK, it turns out, tended to make their auctions end sometime during the weekend, and then to catch the first post on Monday, they would buy the shipping labels from us on a Sunday evening. This meant we would see a spike in requests regularly every Sunday. The graphs were beautiful to behold, but that wasn’t the emergent part.
For some labels, the seller would want to include a tracking number. Tracking numbers are a finite resource, so we needed to be doubly careful that we didn’t hand the same number out on two stamps. You can imagine the confusion if a purchaser looked to see where their parcel was and saw it headed to someone else’s house, purely as the outcome of some race condition. Consequently, it’s easier to create a pool of tracking numbers and then hand them out as required.
One day, a month or so after go-live, we were conducting our regular maintenance checks and took a look at the tracking numbers database. It had way more tracking numbers in an unavailable state than we thought it should (based on how many tracked parcel labels we knew we’d sold). We’d been bleeding tracking numbers, but how?
Upon closer examination, we realized that there were whole ranges of tracking numbers in the database that were stuck in the “reserved” state. These were the tracking numbers that were provided for use in labels but the confirmation of their usage had never been returned. When we looked deeper, we could see that the reserved timestamps corresponded with the Sunday-evening postage peaks, and looking even closer, we saw that these were happening at the start of the ramp-up of requests. What was happening?
It turned out to be the result of a “clever” thing we’d done: we built in an elastic scaling capability for our services so that we would have the right number of instances running when we needed them. What was happening was that, as we scaled all the services rapidly, the system would issue more requests from client microservices than the tracking number service could cope with. Some of these requests would time out, and the calling client microservices would consequently issue retries. Those retries would reserve another tracking number, and everything would then proceed as planned, ultimately marking this new tracking number as “used.” There was one small issue. The failed requests left the tracking numbers they’d been using in a “reserved” state. We’d forgotten to build a mechanism to put those reserved tracking numbers back into the pool if no one came back to mark them as “used” within a certain time.
These chunks of reserved tracking numbers in the database were an emergent property of the running of this simple set of components. I’ll not go into the fix, but it was a simple one. (You’re probably in your head right now telling me how you’d do it.) Rather, let’s focus on the three things you should take away from this example.
First, once you see them, weak emergent effects are not hard to understand. (When observed, emergence rarely is.) What is surprising is that, prior to seeing it, the possibility of it had never even entered your mind.
Second, it’s not hard to work with it once you see a weak emergent effect in operation.
Third, the emergence in this example happened precisely because we believed we’d thought of all eventualities—and we’d thought of a lot of them. What we saw was caused by the interaction between the elements we had put into the system and the users.
You can map this anecdote to the definition that I gave for emergence earlier. First, the fact that emergence creates in systems “properties its parts do not have on their own” is the predictable reserved-but-not-reclaimed pattern we could see in our data store. Second, “emerge only when the parts interact” is the result of a set of factors: the rate of the auto scaling, the nature of the response times from the tracking number service, the duration of the request timeout, the subsequent retries, and the fact that we never thought to reclaim the tracking numbers. All of these played their roles in the pattern we saw.
We could not have architected for this specific weak emergence in advance. You’ll notice that we did think of things—the failures, for instance—and had put in retries and the need to scale instances horizontally based on rising load, but those in turn had created effects in combination with other elements. Despite your best efforts, weak emergence will happen again and again, both where you expect it and where you don’t.
The more you chase things, the more unforeseen emergent events will come to the surface in the form of variability from the expected functioning. This is because the systems we build operate in unpredictable ways, and these emergent effects will always happen. It is not just hard to foresee all possibilities—it is impossible.
So we should stop trying to predict how our systems will run in the wild. Instead, any new approach to architecture should optimize for running architecture in systems in production as soon as possible and respond to the emergent effects as they arise.
Conclusion
My goal in this chapter was to describe the pain that traditional approaches to architecture practice cause. I took you through the challenges of trying to impose traditional centralized architecture practices in a world that favors decentralization.
As decentralization increases, traditional architectural practices incrementally overindex on aspects that are less and less likely to succeed and underindex on what needs to be happening more and more.
Traditional software architecture practices cannot cope with either revolutionary delivery cadences or emergence. We need to stop thinking slowly and linearly because the systems we build are rapid and nonlinear. We need a new way to approach architecture.
This new approach to architecture should incorporate decentralization and rapid, constant feedback. It should acknowledge that chaos and complexity are inevitable, and it should accept that emergence cannot be fought, only embraced.
In Chapter 2, I’ll discuss what must lie at the heart of all approaches to software architecture: architectural decisions. It is only by understanding decisions that alternative ways to practice architecture begin to make sense.
1 Grady Booch, “All architecture is design, but not all design is architecture”, Twitter (now X), November 11, 2021.
2 Martin Fowler, “Making Architecture Matter”, OSCON 2015 keynote address, July 23, 2015, 14 min., 3 sec.
3 Patterns in software have an interesting history, which we frequently misremember. I presented a lightning talk at DDD Europe 2021 on this topic.
4 Reference architectures purport to capture industry “best practice.” The Open Group’s IT4IT reference architecture is a good example of this.
5 Evident in Agile Manifesto principles such as “our highest priority is to satisfy the customer through early and continuous delivery of valuable software”; “deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale”; and “working software is the primary measure of progress.”
6 You could argue it’s remember rather than realize. Sharing of available compute cycles was a core concept in the days of the mainframe. The TV show Halt and Catch Fire (AMC, 2014) spent at least a whole season talking about this.
7 My first encounter with this dream was via the cover of Automating Solaris Installations: A Custom Jumpstart Guide by Paul Anthony Kasper and Alan L. McClellan (Prentice Hall). It depicts a single admin with their feet up on the desk, hands behind their head as automation provisions machine after machine with just the right software.
8 As you learn when you try to write a book, there is a far from tidy history to many of these ideas. One of the earliest in this area, but one that also prefigured many of the other revolutions, is documented in the 1986 Harvard Business Review article “The New New Product Development Game.”
9 There is a third, weaker kind of coupling—information coupling—but this does not block. It causes inefficiencies in other ways. See Vlad Khononov’s talk “The Fractal Geometry of Software Design” for more on this topic.
10 Mel Conway introduced his idea in his 1968 paper “How Do Committees Invent?”. The whole thing is worth reading, but the sentence that is quoted over and over is as follows: “Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” This has now become immortalized in the Inverse Conway Maneuver, which advocates for organizing your software teams along the lines where you want your software to be split.
11 Eric Evans famously wrote in detail about aligning teams to one or more bounded contexts in the final section of his book Domain-Driven Design: Tackling Complexity in the Heart of Software (Addison-Wesley). His goal was to maintain model complexity and keep cognitive load low for the multiple teams building the software. Part of the goal of this was to reduce coupling and therefore allow teams to be more independent.
12 During World War I, businesses in the United States and United Kingdom began to apply a theory of scientific management first developed by Frederick Winslow Taylor between 1885 and 1910. Intended for use in the systematic training of blue collar workers on a large scale, this system analyzed tasks and broke them down into individual, unskilled operations that could then be learned quite quickly. Despite the fact that it never really worked and that, more important, intense role-and-responsibility siloing was entirely unsuited to knowledge work, the technique took hold in Western management circles and remains there in mental models today. For more background, see Peter Drucker’s essay “Management’s New Paradigms” in collections such as The Essential Drucker.
13 Trond Hjorteland has a great LinkedIn post and accompanying talk, both titled “Good Fences Make Good Neighbors,” that go into the benefits of these benign boundaries.
14 It should come as no surprise to find that Dr. Nicole Forsgren, Jez Humble, and Gene Kim not only identified that teams that adopt the practices they laid out in Accelerate (O’Reilly) are more efficient but that they are happier and less burned out, too. I’d encourage you to read the entire book to understand this in depth, but the summary flow chart from the book lays it out very clearly.
15 This is the second time I’ve mentioned Conway and his 1968 paper. It’s only four pages long. I’d recommend you take the time to read what is an incredibly influential and prescient study.
16 This is actually straying into the organizational world, but the reason for the blocking is still a technical coupling one. If the systems were technically decoupled, the software elements could be deployed independently.
17 Even if the team is practicing DORA Elite levels of continuous deployment practice. By DORA Elite, I’m referring to the four levels of engineering effectiveness introduced by Forsgren et al., first in their annual “DORA State of DevOps” reports and subsequently in their book Accelerate. Of the four levels, Elite was the best. It meant you were in the top bracket across all surveyed in lead time for changes (short), deployment frequency (large), change failure rate (low), and mean time to recovery (short).
18 If this reminds you of techniques and approaches like John Boyd’s OODA (observe, orient, decide, act) loop, sense and respond, build-measure-learn, continuous delivery, and everything laid out in the DORA State of DevOps report’s four key metrics, then you’re on the right track. My point is that we need to bring the practice of software architecture into line with all these other techniques.
19 Oxford Dictionary of English (2010), under “chaos.”
20 It might not come as a surprise to learn that “Latency Monkey,” which does exactly this, is a less famous member of the Netflix “Simian Army” of chaos engineers.
Get Facilitating Software Architecture now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.