THE INTERNET IS AN INTERESTING MEDIUM IN WHICH TO WORK. Almost all forms of business are now being conducted on the Internet, and people continue to capitalize on the fact that a global audience is on the other side of the virtual drive-thru window.
The Internet changes so quickly that we rarely have time to cogitate what we’re doing and why we’re doing it. When it comes to operating the fabric of an online architecture, things move so fast and change so significantly from quarter to quarter that we struggle to stay in the game, let alone ahead of it. This high-stress, overstimulating environment leads to treating the efforts therein as a job without the concept of a career.
What’s the difference, you ask? A career is an occupation taken on for a significant portion of one’s life, with opportunities for progress. A job is a paid position of regular employment. In other words, a job is just a job.
Although the Internet has been around for more than a single generation at this point, the Web in its current form is still painfully young and is only now breaking past a single generational marker. So, how can you fill a significant portion of your life with a trade that has existed for only a fraction of the time that one typically works in a lifetime? At this point, to have finished a successful career in web operations, you must have been pursuing this art for longer than it has existed. In the end, it is the pursuit that matters. But make no mistake: pursuing a career in web operations makes you a frontiersman.
Web operations has no defined career path; there is no widely accepted standard for progress. Titles vary, responsibilities vary, and title escalation happens on vastly different schedules from organization to organization.
Although the term web operations isn’t awful, I really don’t like it. The captains, superstars, or heroes in these roles are multidisciplinary experts; they have a deep understanding of networks, routing, switching, firewalls, load balancing, high availability, disaster recovery, Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) services, NOC management, hardware specifications, several different flavors of Unix, several web server technologies, caching technologies, database technologies, storage infrastructure, cryptography, algorithms, trending, and capacity planning. The issue is: how can we expect to find good candidates who are fluent in all of those technologies? In the traditional enterprise, you have architects who are broad and shallow paired with a multidisciplinary team of experts who are focused and deep. However, the expectation remains that your “web operations” engineer be both broad and deep: fix your gigabit switch, optimize your database, and guide the overall infrastructure design to meet scalability requirements.
Although I emphatically hope the situation changes, as it stands now there is no education that prepares an individual for today’s world of operating web infrastructures—neither academic nor vocational. Instead, identifying computer science programs or other academic programs that instill strong analytical skills provides a good starting point, but to be a real candidate in the field of web operations you need three things:
Because of the broad required understanding of architectural components, it helps tremendously to understand the ins and outs of the computing systems on which all this stuff runs. Processor architectures, memory systems, storage systems, network switching and routing, why Layer 2 protocols work the way they do, HTTP, database concepts...the list could go on for pages. Having the basics down pat is essential in understanding why and how to architect solutions as well as identify brokenness. It is, after all, the foundation on which we build our intelligence. Moreover, an engineering mindset and a basic understanding of the laws of physics can be a great asset.
In a conversation over beers one day, my friend and compatriot in the field of web operations, Jesse Robbins, told a story of troubleshooting a satellite-phone issue. A new sat-phone installation had just been completed, and there was over a second of “unexpected” latency on the line. This was a long time ago, when these things cost a pretty penny, so there was some serious brooding frustration about quality of service. After hours of troubleshooting and a series of escalations, the technician asked: “Just to be clear, this second of latency is in addition to the expected second of latency, right?” A long pause followed. “What expected latency?” asked the client. The technician proceeded to apologize to all the people on the call for their wasted time and then chewed out the client for wasting everyone’s time. The expected latency is the amount of time it takes to send the signal to the satellite in outer space and back again. And as much as we might try, we have yet to find a way to increase the speed of light.
Although this story seems silly, I frequently see unfettered, unrealistic expectations. Perhaps most common are cross-continent synchronous replication attempts that defy the laws of physics as we understand them today. We should remain focused on being site reliability engineers who strive to practically apply the basics of computer science and physics that we know. To work well within the theoretical bounds, one must understand what those boundaries are and where they lie. This is why some theoretical knowledge of computer science, physics, electrical engineering, and applied math can be truly indispensable.
Operations is all about understanding where theory and practice collide, and devising methodologies to limit the casualties from the explosions that ensue.
Although being indecisive is a disadvantage in any field, in web operations there is a near-zero tolerance for it. Like EMTs and ER doctors, you are thrust into situations on a regular basis where good judgment alone isn’t enough—you need good judgment now. Delaying decisions causes prolonged outages. You must train your brain to apply mental processes continually to the inputs you receive, because the “collect, review, propose” approach will leave you holding all the broken pieces.
In computer science, algorithms can be put into two categories: offline and online. An offline algorithm is a solution to a problem in which the entire input set is required before an output can be determined. In contrast, an online algorithm is a solution that can produce output as the inputs are arriving. Of course, because the algorithm produces output (or solutions) without the entire input set, there is no way to guarantee an optimal output. Unlike an offline algorithm, an online algorithm can always ensure that you have an answer on hand.
Operations decisions must be the product of online algorithms, not offline ones. This isn’t to say that offline algorithms have no place in web operations; quite the contrary. One of the most critically important processes in web operations is offline: root-cause analysis (RCA). I’m a huge fan of formalizing the RCA process as much as possible. The thorough offline (postmortem) analysis of failures, their pathologies, and a review of the decision made “in flight” is the best possible path to improving the online algorithms you and your team use for critical operations decision making.
A calm and controlled thought process is critical. When it is absent, Keystone Kops syndrome prevails and bad situations are made worse. In crazy action movies, when one guy has a breakdown the other grabs him, shakes him, and tells him to pull himself together—you need to make sure you’re on the right side of that situation. On one side, you have a happy, healthy career; on the other, you have a job in which you will shoulder an unhealthy amount of stress and most likely burn out.
Because there is no formal education path, the web operations trade, as it stands today, is an informal apprentice model. As the Internet has caused paradigm shifts in business and social interaction, it has offered a level of availability and ubiquity of information that provides a virtualized master–apprentice model. Unfortunately, as one would expect from the Internet, it varies widely in quality from group to group.
In the field of web operations, the goal is simply to make everything run all the time: a simple definition, an impossible prospect. Perhaps the more challenging aspect of being an engineer in this field is the unrealistic expectations held by peers within the organization.
So, how does one pursue a career with all these obstacles?
When you allow yourself to meditate on a question, the answer most often is simple and rather unoriginal. It turns out that being a master web operations engineer is no different from being a master carpenter or a master teacher. The effort to master any given discipline requires four basic pursuits: knowledge, tools, experience, and discipline.
Knowledge is a uniquely simple subject on the Internet. The Internet acts as a very effective knowledge-retention system. The common answer to many questions, “Let me Google that for you,” is an amazingly effective and high-yield answer. Almost everything you want to know (and have no desire to know) about operating web infrastructure is, you guessed it, on the Web.
Limiting yourself to the Web for information is, well, limiting. You are not alone in this adventure, despite the feeling. You have peers, and they need you as much as you need them. User groups (of a startling variety) exist around the globe and are an excellent place to share knowledge.
If you are reading this, you already understand the value of knowledge through books. A healthy bookshelf is something all master web operations engineers have in common. Try to start a book club in your organization, or if your organization is too small, ask around at a local user group.
One unique aspect of the Internet industry is that almost nothing is secret. In fact, very little is even proprietary and, quite uniquely, almost all specifications are free. How does the Internet work? Switching: there is an IEEE specification for that. IP: there is RFC 791 for that. TCP: RFC 793. HTTP: RFC 2616. They are all there for the reading and provide a much deeper foundational base of understanding. These protocols are the rules by which you provide services, and the better you understand them, the more educated your decisions will be. But don’t stop there! TCP might be described in RFC 793, but all sorts of TCP details and extensions and “evolution” are described in related RFCs such as 1323, 2001, 2018, and 2581. Perhaps it’s also worthwhile to understand where TCP came from: RFC 761.
To revisit the theory and practice conundrum, the RFC for TCP is the theory; the kernel code that implements the TCP stack in each operating system is the practice. The glorious collision of theory and practice are the nuances of interoperability (or inter-inoperability) of the different TCP implementations, and the explosions are slow download speeds, hung sessions, and frustrated users.
On your path from apprentice to master, it is your job to retain as much information as possible so that the curiously powerful coil of jello between your ears can sort, filter, and correlate all that trivia into a concise and accurate picture used to power decisions: both the long-term critical decisions of architecture design and the momentary critical decisions of fault remediation.
Tools, in my experience, are one of the most incessantly and emphatically argued topics in computing: vi versus Emacs, Subversion versus Git, Java versus PHP—beginning as arguments from different camps but rapidly evolving into nonsensical religious wars.
The simple truth is that people are successful with these tools despite their pros and cons. Why do people use all these different tools, and why do we keep making more? I think Thomas Carlyle and Benjamin Franklin noted something important about our nature as humans when they said “man is a tool-using animal” and “man is a tool-making animal,” respectively. Because it is in our nature to build and use tools, why must we argue fruitlessly about their merits? Although Thoreau meant something equally poignant, I feel his commentary that “men have become the tools of their tools” is equally accurate in the context of modern vernacular.
The simple truth is articulated best by Emerson: “All the tools and engines on Earth are only extensions of man’s limbs and senses.” This articulates well the ancient sentiment that a tool does not the master craftsman make. In the context of Internet applications, you can see this in the wide variety of languages, platforms, and technologies that are glued together successfully. It isn’t Java or PHP that makes an architecture successful, it is the engineers that design and implement it—the craftsmen.
One truth about engineering is that knowing your tools, regardless of the tools that are used, is a prerequisite to mastering the trade. Your tools must become extensions of your limbs and senses. It should be quite obvious to engineers and nonengineers alike that reading the documentation for a tool during a crisis is not the best use of one’s time. Knowing your tools goes above and beyond mere competency; you must know the effects they produce and how they interact with your environment—you must be practiced.
A great tool in any operations engineer’s tool chest is a system call tracer. They vary (slightly) from system to system. Solaris has truss, Linux has strace, FreeBSD has ktrace, and Mac OS X had ktrace but displaced that with the less useful dtruss. A system call tracer is a peephole into the interaction between user space and kernel space; in other words, if you aren’t computationally bound, this tool tells you what exactly your application is asking for and how long it takes to be satisfied.
DTrace is a uniquely positioned tool available on Solaris, OpenSolaris, FreeBSD, Mac OS X, and a few other platforms. This isn’t really a chapter on tools, but DTrace certainly deserves a mention. DTrace is a huge leap forward in system observability and allows the craftsman to understand his system like never before; however, DTrace is an oracle in both its perspicacity and the fact that the quality of its answers is coupled tightly with the quality of the question asked of it. System call tracers, on the other hand, are a proverbial avalanche—easy to induce and challenging to navigate.
Why are we talking about avalanches and oracles? It is an aptly mixed metaphor for the amorphous and heterogeneous architectures that power the Web. Using strace to inspect what your web server is doing can be quite enlightening (and often results in some easily won optimizations the first few times). Looking at the output for the first time when something has gone wrong provides basically no value except to the most skilled engineers; in fact, it can often cost you. The issue is that this is an experiment, and you have no control. When something is “wrong” it would be logical to look at the output from such a tool in an attempt to recognize an unfamiliar pattern. It should be quite clear that if you have failed to use the tool under normal operating conditions, you have no basis for comparison, and all patterns are unfamiliar. In fact, it is often the case that patterns that appear to be correlated to the problem are not, and much time is wasted pursuing red herrings.
Diffusing the tools argument is important. You should strive to choose a tool based on its appropriateness for the problem at hand rather than to indulge your personal preference. An excellent case in point is the absolutely superb release management of the FreeBSD project over its lifetime using what is now considered by most to be a completely antiquated version control system (CVS). Many successful architectures have been built atop the PHP language, which lacks many of the features of common modern languages. On the flip side, many projects fail even when equipped with the most robust and capable tools. The quality of the tool itself is always far less important than the adroitness with which it is wielded. That being said, a master craftsman should always select an appropriate, high-quality tool for the task at hand.
Experience is one of the most powerful weapons in any situation. It is so important because it means so many things. Experience is, in its very essence, making good judgments, and it is gained by making bad ones. Watching theory and practice collide is both scary and beautiful. The collision inevitably has casualties—lost data, unavailable services, angered users, and lost money—but at the same time its full context and pathology have profound beauty. Assumptions have been challenged (and you have lost) and unexpected outcomes have manifested, and above all else, you have the elusive opportunity to be a pathologist and gain a deeper understanding of a new place in your universe where theory and practice bifurcate.
Experience and knowledge are quite interrelated. Knowledge can be considered the studying of experiences of others. You have the information but have not grasped the deeper meaning that is gained by directly experiencing the causality. That deeper meaning allows you to apply the lesson learned in other situations where your experience-honed insight perceives correlations—an insight that often escapes those with knowledge alone.
Experience is both a noun and a verb: gaining it is as easy (and as hard) as doing it.
Although gaining experience is as easy as simply “doing,” in the case of web operations it is the process of making and surviving bad judgments. The question is: how can an organization that is competing in such an aggressive industry afford to have its staff members make bad judgments? Having and executing on an answer to this question is fundamental to any company that wants to house career-oriented web operations engineers. There are two parts to this answer, a yin and yang if you will.
The first is to make it safe for junior and mid-level engineers to make bad judgments. You accomplish this by limiting liability and injury from individual judgments. The environment (workplace, network, systems, and code) can all survive a bad judgment now and again. You never want to be forced into the position of firing an individual because of a single instance of bad judgment (although I realize this cannot be entirely prevented, it is a good goal). The larger the mistake, the more profound the opportunity to extract deep and lasting value from the lesson. This leads us to the second part of the answer.
Never allow the same bad judgment twice. Mistakes happen. Bad judgments will occur as a matter of fact. Not learning from one’s mistakes is inexcusable. Although exceptions always exist, you should expect and promote a culture of zero tolerance for repetitious bad judgment.
One thing that has bothered me for quite some time and continues to bother me is job applications from junior operations engineers for senior positions. Their presumption is that knowledge dictates hierarchical position within a team; just as in other disciplines, this is flat-out wrong. The single biggest characteristic of a senior engineer is consistent and solid good judgment. This obviously requires exposure to situations where judgment is required and is simple math: the rate of difficult situations requiring judgment multiplied by tenure. It is possible to be on a “fast track” by landing an operations position in which disasters strike at every possible moment. It is also possible to spend 10 years in a position with no challenging decisions and, as a result, accumulate no valuable experience.
Generation X (and even more so, Generation Y) are cultures of immediate gratification. I’ve worked with a staggering number of engineers who expect their “career path” to take them to the highest ranks of the engineering group inside five years just because they are smart. This is simply impossible in the staggering numbers I’ve witnessed. Not everyone can be senior. If, after five years, you are senior, are you at the peak of your game? After five more years will you not have accrued more invaluable experience? What then: “super engineer”? What about five years later: “super-duper engineer”? I blame the youth of our discipline for this affliction. The truth is that very few engineers have been in the field of web operations for 15 years. Given the dynamics of our industry, many elected to move on to managerial positions or risk an entrepreneurial run at things.
I have some advice for individuals entering this field with little experience: be patient. However, this adage is typically paradoxical, as your patience very well may run out before you comprehend it.
Discipline, in my opinion, is the single biggest disaster in our industry. Web operations has an atrocious track record when it comes to structure, process, and discipline. As a part of my job, I do a lot of assessments. I go into companies and review their organizational structure, operational practices, and overall architecture to identify when and where they will break down as business operations scale up.
Can you guess what I see more often than not? I see lazy cowboys and gunslingers; it’s the Wild, Wild West. Laziness is often touted as a desired quality in a programmer. In the Perl community, where this became part of the mantra, the meaning was tongue-in-cheek (further exemplified by the use of the word hubris in the same mantra). What is meant is that by doing things as correctly and efficiently as possible you end up doing as little work as possible to solve a particular problem—this is actually quite far from laziness. Unfortunately, others in the programming and operations fields have taken actual laziness as a point of pride to which I say, “not in my house.”
Discipline is controlled behavior resulting from training, study, and practice. In my experience, a lack of discipline is the most common ingredient left out of a web operations team and results in inconsistency and nonperformance.
Discipline is not something that can be taught via a book; it is something that must be learned through practice. Each task you undertake should be approached from the perspective of a resident. Treating your position and responsibilities as long term and approaching problems to develop solutions that you will be satisfied with five years down the road is a good basis for the practice that results in discipline.
I find it ironic that software engineering (a closely related field) has a rather good track record of discipline. I conjecture that the underlying reason for a lack of discipline within the field of web operations is the lack of a career path itself. Although it may seem like a chicken-and-egg problem, I have overwhelming confidence that we are close to rewarding our field with an understood career path.
It is important for engineers who work in the field now to participate in sculpting what a career in operations looks like. The Web is here to stay, and services thereon are becoming increasingly critical. Web operations “the career” is inevitable. By participating, you can help to ensure that the aspect of your job that seduced you in the first place carries through into your career.
The part that keeps me fascinated is witnessing the awesomeness of continuous and unique collisions between theory and practice. Because we are responsible for “correct operation” of the whole architecture, traditional boundaries are removed in a fashion that allows us to freely explore the complete pathology of failures.
Pursuing a career in web operations places you in a position to be one of the most critical people in your organization’s online pursuits. If you do it well, you stand to make the Web a better place for everyone.