The evolution of systems requires an evolution of systems engineers

Systems and site reliability engineers, architects, and application developers must create new strategies to meet industry shifts and their constraints.

By James Turnbull and Ines Sombra

March 20, 2018

Nature-kaleidoscope (source: Reinhard Klar on Flickr)

Over the last few weeks, we’ve been reflecting on changes in the technology industry from when we first started our careers up to now. We’ve been looking at the changes in two different but overlapping spheres: changes to technology and changes to methodology. The systems we worked on when many of us first started out were the first generations of client-server applications. They were fundamentally different from the prior generation: terminals connecting to centralized apps running on mainframe or midrange systems. Engineers learned to care about the logic of their application client as well as the server powering it. Connectivity, the transmission of data, security, latency and performance, and the synchronization of state between the client and the server became issues that now had to be considered to manage those systems.

This increase in sophistication spawned commensurate changes to the complexity of the methodologies and skills required to manage those systems. New types of systems meant new skills, understanding new tools, frameworks, and programming languages. We can trace back to this moment the spawning of numerous new specializations that had previously been more concentrated in single roles: front-end engineers, back-end engineers, data scientists, designers, UX/UI specialists, and a myriad other specialities. We can perhaps also trace back to this period the construction of more siloed functions and the increased complexity in transitions between those silos. The silos that the DevOps and SRE communities are attempting to dismantle today.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Since the first generation of client-server systems, we’ve seen significant evolution. Much of it driven by the emergence of technology as being mission critical to doing business—for any business in every industry. This has been coupled with customer demand for fast, immediate functionality available on devices, delivered seamlessly across different geographies and fabrics. Take, for example, the evolution of renting videos from the corner video store to streaming on Netflix and Hulu and their peers. Our expectation of latency for the delivery of content has dropped from hours or minutes to seconds. Our expectation of the delivery of that content is that it’ll be available to us 24x7x365 on every device we own and in every location: from our homes and offices to being on the move. We, as customers, also don’t care about the infrastructure or the complexity of the systems required to deliver this: we just want to binge watch the new season of Jessica Jones.

Each iteration of this evolution has required the technology, systems, and skills we need to build and manage that technology to change. In almost every case, those changes have introduced more complexity. The skills and knowledge we once needed to manage our client-server systems versus these modern distributed systems with their requirements for resilience, low latency, and high availability are vastly different. So, what do we need to know now that we didn’t before?

Redefining the minimum viable product

As practitioners, we’ve had to build better. With availability and resilience being prime concerns, the definition of an application’s minimum viable product has had to be redefined. Good design goals now have to include a baseline architecture for operability, security, performance, and observability. Every engineer, from a front-end engineer working on a React component, to a back-end engineer building a distributed data store, needs to consider how their piece of the system will impact the overall system.

This is especially true because the performance demands of our users have created new constraints in the computational models and state management strategies available to our systems. Computational models are turning to serverless and edge computing architectures to reduce latency for users. The new lesson we’ve learned: it’s always more efficient to perform computations as close to the end user as possible.

This is also true for state management. Applications are being deployed from inception with distributed state, shared storage, and possibly even the migration of data (or some segment of data) from centralized stores into the edge and the cloud. But being closer to the end user enables faster decisions at the expense of greatly increasing the complexity of our applications.

Both of these constraints mean engineers need to understand how their part of the stack pairs with the other pieces and what the implications of a seemingly small change might have on the overall system. And when this can’t be modeled mentally, due to complexity or lack of insight into the systems, then it has to be modeled programmatically via observability, instrumentation, tracing, and tests.

We can no longer only use simplistic probing to identify failures or easily provide sufficient information to debug faults. Applications with complex architectures and distributed state, that look fully functional to probes, may not be performing optimally or accurately for end users. Even when looking at metrics and events, which in turn require correlation and leveling across disparate systems, we struggle to gain a full picture as traditional approaches and even calculations of latency are less accurate for distributed systems.

The instrumentation of your applications is now a mandatory step in the development process and no longer an afterthought. Every engineer needs to consider how to articulate the state, performance, and observability of their aspects of the system. This requires engineers to develop the skills and adopt the techniques to ship these new capabilities.

An evolving tech ecosystem

New frameworks, architectures, processes, and a thriving ecosystem of tools have emerged to help us meet those challenges. Some of these are in an embryonic state, but rapid adoption is driving quick maturity. We’ve seen this evolution in compute: it’s only been four years since containers became a mainstream technology, and we are now working with complex application-level abstractions enabled by tools like Kubernetes. A similar evolution is occuring with deployment, serverless, edge-computing technology, security, performance, and system observability.

Finally, no changes can exist in a human and organizational vacuum. We have to develop the leadership skills necessary to build truly cross-functional teams and enable that rapid iteration needed to build these systems. We have to continue the work of the DevOps and SRE communities to break down silos and streamline transitions between teams and increase development velocity. Teams structured around swiftly delivering high-quality, secure, and performant applications create highly innovative products and organizations.

For practitioners and organizations at the start of the journey or well within it, O’Reilly’s Velocity Conference has a program line-up to help companies navigate these modern complexities. Developers and engineers from companies like Google, Netflix, Microsoft, Amazon, Twitter, Nordstrom, Slack, and Fastly will be talking about how they’ve both failed and succeeded at building, scaling, and securing distributed systems. You’ll get a chance to learn, network, laugh, and share with peers and industry leaders.

Post topics: Operations