Chapter 1. Web Operations: The Career

Theo Schlossnagle

THE INTERNET IS AN INTERESTING MEDIUM IN WHICH TO WORK. Almost all forms of business are now being conducted on the Internet, and people continue to capitalize on the fact that a global audience is on the other side of the virtual drive-thru window.

The Internet changes so quickly that we rarely have time to cogitate what we're doing and why we're doing it. When it comes to operating the fabric of an online architecture, things move so fast and change so significantly from quarter to quarter that we struggle to stay in the game, let alone ahead of it. This high-stress, overstimulating environment leads to treating the efforts therein as a job without the concept of a career.

What's the difference, you ask? A career is an occupation taken on for a significant portion of one's life, with opportunities for progress. A job is a paid position of regular employment. In other words, a job is just a job.

Although the Internet has been around for more than a single generation at this point, the Web in its current form is still painfully young and is only now breaking past a single generational marker. So, how can you fill a significant portion of your life with a trade that has existed for only a fraction of the time that one typically works in a lifetime? At this point, to have finished a successful career in web operations, you must have been pursuing this art for longer than it has existed. In the end, it is the pursuit that matters. But make no mistake: pursuing a career in web operations makes you a frontiersman.

Why Does Web Operations Have It Tough?

Web operations has no defined career path; there is no widely accepted standard for progress. Titles vary, responsibilities vary, and title escalation happens on vastly different schedules from organization to organization.

Although the term web operations isn't awful, I really don't like it. The captains, superstars, or heroes in these roles are multidisciplinary experts; they have a deep understanding of networks, routing, switching, firewalls, load balancing, high availability, disaster recovery, Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) services, NOC management, hardware specifications, several different flavors of Unix, several web server technologies, caching technologies, database technologies, storage infrastructure, cryptography, algorithms, trending, and capacity planning. The issue is: how can we expect to find good candidates who are fluent in all of those technologies? In the traditional enterprise, you have architects who are broad and shallow paired with a multidisciplinary team of experts who are focused and deep. However, the expectation remains that your "web operations" engineer be both broad and deep: fix your gigabit switch, optimize your database, and guide the overall infrastructure design to meet scalability requirements.

Web operations is broad; I would argue almost unacceptably broad. A very skilled engineer must know every commonly deployed technology at a considerable depth. The engineer is responsible for operating a given architecture within the described parameters (usually articulated in a service-level agreement, or SLA). The problem is that architecture is, by its very definition, everything. Everything from datacenter space, power, and cooling up through the application stack and all the way down to the HTML rendering and JavaScript executing in the browser on the other side of the planet. Big job? Yes. Mind-bogglingly so.

Although I emphatically hope the situation changes, as it stands now there is no education that prepares an individual for today's world of operating web infrastructures—neither academic nor vocational. Instead, identifying computer science programs or other academic programs that instill strong analytical skills provides a good starting point, but to be a real candidate in the field of web operations you need three things:

A Strong Background in Computing

Because of the broad required understanding of architectural components, it helps tremendously to understand the ins and outs of the computing systems on which all this stuff runs. Processor architectures, memory systems, storage systems, network switching and routing, why Layer 2 protocols work the way they do, HTTP, database concepts...the list could go on for pages. Having the basics down pat is essential in understanding why and how to architect solutions as well as identify brokenness. It is, after all, the foundation on which we build our intelligence. Moreover, an engineering mindset and a basic understanding of the laws of physics can be a great asset.

In a conversation over beers one day, my friend and compatriot in the field of web operations, Jesse Robbins, told a story of troubleshooting a satellite-phone issue. A new sat-phone installation had just been completed, and there was over a second of "unexpected" latency on the line. This was a long time ago, when these things cost a pretty penny, so there was some serious brooding frustration about quality of service. After hours of troubleshooting and a series of escalations, the technician asked: "Just to be clear, this second of latency is in addition to the expected second of latency, right?" A long pause followed. "What expected latency?" asked the client. The technician proceeded to apologize to all the people on the call for their wasted time and then chewed out the client for wasting everyone's time. The expected latency is the amount of time it takes to send the signal to the satellite in outer space and back again. And as much as we might try, we have yet to find a way to increase the speed of light.

Although this story seems silly, I frequently see unfettered, unrealistic expectations. Perhaps most common are cross-continent synchronous replication attempts that defy the laws of physics as we understand them today. We should remain focused on being site reliability engineers who strive to practically apply the basics of computer science and physics that we know. To work well within the theoretical bounds, one must understand what those boundaries are and where they lie. This is why some theoretical knowledge of computer science, physics, electrical engineering, and applied math can be truly indispensable.

Operations is all about understanding where theory and practice collide, and devising methodologies to limit the casualties from the explosions that ensue.

Practiced Decisiveness

Although being indecisive is a disadvantage in any field, in web operations there is a near-zero tolerance for it. Like EMTs and ER doctors, you are thrust into situations on a regular basis where good judgment alone isn't enough—you need good judgment now. Delaying decisions causes prolonged outages. You must train your brain to apply mental processes continually to the inputs you receive, because the "collect, review, propose" approach will leave you holding all the broken pieces.

In computer science, algorithms can be put into two categories: offline and online. An offline algorithm is a solution to a problem in which the entire input set is required before an output can be determined. In contrast, an online algorithm is a solution that can produce output as the inputs are arriving. Of course, because the algorithm produces output (or solutions) without the entire input set, there is no way to guarantee an optimal output. Unlike an offline algorithm, an online algorithm can always ensure that you have an answer on hand.

Operations decisions must be the product of online algorithms, not offline ones. This isn't to say that offline algorithms have no place in web operations; quite the contrary. One of the most critically important processes in web operations is offline: root-cause analysis (RCA). I'm a huge fan of formalizing the RCA process as much as possible. The thorough offline (postmortem) analysis of failures, their pathologies, and a review of the decision made "in flight" is the best possible path to improving the online algorithms you and your team use for critical operations decision making.

A Calm Disposition

A calm and controlled thought process is critical. When it is absent, Keystone Kops syndrome prevails and bad situations are made worse. In crazy action movies, when one guy has a breakdown the other grabs him, shakes him, and tells him to pull himself together—you need to make sure you're on the right side of that situation. On one side, you have a happy, healthy career; on the other, you have a job in which you will shoulder an unhealthy amount of stress and most likely burn out.

Because there is no formal education path, the web operations trade, as it stands today, is an informal apprentice model. As the Internet has caused paradigm shifts in business and social interaction, it has offered a level of availability and ubiquity of information that provides a virtualized master–apprentice model. Unfortunately, as one would expect from the Internet, it varies widely in quality from group to group.

In the field of web operations, the goal is simply to make everything run all the time: a simple definition, an impossible prospect. Perhaps the more challenging aspect of being an engineer in this field is the unrealistic expectations held by peers within the organization.

So, how does one pursue a career with all these obstacles?

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.