While the underlying concepts like managing capacity and security have remained the same, system administration has changed over the last couple of decades. Early administration required in-depth knowledge of services running on individual systems. Books on system administration focused on specific services on the systems from printing to DNS. The first conference dedicated to system administration, LISA, described large scale as sites for over 100 users.
Now, operations engineers are faced with an ever-growing list of technologies and third-party services to learn about and leverage as they build and administer systems and services that have thousands to millions of users. Software development is moving fast, and sysadmins need to move as quickly to accommodate and deliver value.
I wrote this book for all the experienced system administrators, IT professionals, support engineers, and other operation engineers who are looking for a map to understanding the landscape of contemporary operation tools, technologies and practices. This book may also be useful to developers, testers, and who want to level up their operability skills.
In this book, I examine the modernization of system administration and how collaboration, automation, and system evolution change the fundamentals of operations. This book is not a “how-to” reference, as there are many quality reference materials to dig into specific topics. Where relevant, I recommend materials to level your skills in that area. I provide examples to guide a deeper understanding of the essential concepts that individuals need to understand, evaluate and execute on their work.
The focus is on tools and technologies in broad use currently, but progress is rapid with new tools and technologies coming into use all the time. These new tools may supplant today’s favorite tools with little notice. Don’t worry about learning the wrong tools; study the underlying concepts. Apply these concepts to evaluate and adopt tools as they become available.
At its core, modern system administration is about assessing and regulating risk to the business. It encompasses changes in how sysadmins collaborate with development and testing, deploy and configure services, and scale in production due to increased complexity of infrastructure and data generation.
Collaboration with Development and Testing
The first part of the book focuses on collaboration with development and testing. I’ll cover some of the tools and techniques to improve how you work and communicate about your work.
You can’t be the lone sysadmin anymore known for saying “no.” The nature of the work may start at understanding operating systems, but it spans across understanding services across different platforms while working in collaboration with other teams within the organization and potentially external to your team. You must adopt tools and practices from across the organization to better perform your job.
You need to be comfortable with using the terminal and graphical interfaces. Just about every tool I’ll cover has some aspect of command line usage. Being able to explore and use the tools helps you understand when problems arise with the automation. When you have to debug the automation, you need to know whether it’s the tool or your use of the tool.
You can’t ignore version control. For years, DORA’s annual State of DevOps report has reported that the use of version control highly correlates to high IT performers.1 Version control is fundamental to collaboration with other parts of the organization whether you’re writing code to set up local development and test environments or deploying applications in a consistent and repeatable manner. Version control is also critical for managing your documentation whether it’s README’s embedded in a project repository, or as a separate project that spans content for the organization. You administer tests of the code you write, as well as the infrastructure that you build within version control.
You build and maintain virtual images and containers for use locally as well as within the cloud. All of this requires some understanding of how to read, debug, and in some cases write code in a particular language. Depending on the environment, Ruby, Python, or Go may be in use.
While I include some code snippets in various languages, this book cannot cover the multitude of information that’s important to learn a specific language. While you can (and should) specialize in a specific language, don’t limit yourself to a single language as languages do have different strengths. Early Linux administration focused on bash or Perl scripts. Now individuals may additionally use Go, Python, or Rust. Folks who limit their ability to adopt other languages will hinder their employability as new tools evolve.
Whether you are collaborating on a project with development, or just within your role-specific Operations team, you need to define and build development environments to replicate the work quickly that others have done. You then can make small changes to projects — whether they are infrastructure code, deployment scripts, or database changes — before committing code to version control and having it tested.
The second part of this book covers managing infrastructure. Systems administration practices that work well when managing isolated systems are generally not transferable to cloud environments. Storage and networking are fundamentally different in the cloud, changing how you architect reliable systems and plan to remediate disasters.
For example, network tuning that you might handcraft with ttcp testing between nodes in your data centers is no longer applicable when your cloud provider limits your network capacity. Instead, balance the abilities gained from administering networks in the data center along with in-depth knowledge about the cloud providers limits to build out reliable systems in the cloud.
In addition to version control, you need to build reusable, versioned artifacts from source. This will include building and configuring a continuous integration and continuous delivery pipeline. Automation of your infrastructure reduces the cost of creating and maintaining environments, reduces the risk of single points of critical knowledge, and simplifies the testing and upgrading of environments.
Scaling Production Readiness
The third part of the book covers the different practices and processes that enable scaling system administration. As a company grows, monitoring and observability, capacity planning, log management and analysis, security and compliance, on-call and incident management are critical areas to maintain, monitor and manage risk to the organization.
The landscape of user expectations and reporting has changed with services such as Facebook, Twitter, and Yelp providing areas for individuals to report their dissatisfaction. To maintain the trust of your users (and potential users), in addition to improvements to how you manage and analyze your logs, you need to update security and compliance tools and processes. You also need to establish a robust incident response to issues when we discover them (or worse when our users find them).
Detailed systems monitoring adds application insights, deeper observability, and tracing. In the past, system administration focused more on system metrics, but as you scale to larger and more complex environments, system metrics are less helpful and in some cases not available. Individual systems are less critical as you focus on the quality of the application and the impact on your users.
Capacity planning goes beyond spreadsheets that examine hardware projections and network bandwidth utilization. With cloud computing, you don’t have the long lead times between analysis of need and delivery of infrastructure. You may not spend time performing traditional tasks such as ordering hardware, and “racking and stacking” of hardware in a data center. Instance availability is near instantaneous, and you don’t need to pay for idle systems anymore.
Whether containerized microservices, serverless, or monolithic applications, log management, and analysis needs have become more complex. The matrix of possible events and how to provide additional context to your testing, debugging, and utilization of services is critical to the functioning of the business.
The system administrator role is a critical role that encompasses a wide range of ever-evolving skills. Throughout this book, I share the fundamental skills to support architecting robust highly scalable services. I’ll focus on the tools and technologies to integrate into your work so that you can be a more effective systems administrator.
A Role by any Other Name
I have experienced a dissonance over the last ten years over the role “sysadmin”. There is so much confusion about what a sysadmin is. Is a sysadmin an operator? Is a sysadmin the person with root? There have been an explosion in terms and titles as people try to divorce themselves from the past. When someone said to me “I’m not a sysadmin, I’m an infrastructure engineer”, I realized that it’s not just me feeling this.
To keep current with the tides of change within the industry, organizations have taken to retitling their system administration postings to devops engineer or site reliability engineer (SRE). Sometimes this is a change in name only with the original sysadmin roles and responsibilities remaining the same. Other times these new titles encompass an entirely new role with similar responsibilities. Often it’s an amalgamation of old and new positions within operations, testing, and development. Let’s talk a little about the differences in these role titles and set some common context around them.
In 2009 at the O’Reilly Velocity Santa Clara conference, John Allspaw and Paul Hammond co-presented “10+ deploys per day: Dev and Ops Cooperation at Flickr”. When a development team is incentivized to get features delivered to production, and the operations team is incentivized to ensure that the platform is stable, these two teams have competing goals that increase friction. Hammond and Allspaw shared how it was possible to take advantage of small opportunities to work together to create substantial cultural change. The cultural changes helped them to get to 10+ deploys per day.
In attendance for that talk, Andrew Clay Shafer, co-founder of Puppet Labs tweeted out:
Don’t just say ‘no', you aren’t respecting other people’s problems… #velocityconf #devops #workingtogether
Andrew Clay Shafer (@littleidea)
Having almost connected with Shafer at an Agile conference over the topic of Agile Operations, Patrick Debois was watching Shafer’s tweets and lamented not being able to attend in person. An idea was planted, and Debois organized the first devopsdays in Ghent. Later Debois wrote “And remember it’s all about putting the fun back in IT” 2 in a post-write up of that first devopsday event. So much time has passed since that first event, and devopsdays has grown in locations3, to over 70 events in 2019 with new events started by local organizers every year.
But what is devops? It’s very much a folk model that gets defined differently depending on the individual, team, or organization. There is something about devops that differentiates practitioners from nonpractitioners as evidenced by the scientific data backed analysis performed by Dr. Nicole Forsgren in the DORA Accelerate DevOps Report.4
At its essence, I see devops as a way of thinking and working. It is a framework for sharing stories and developing empathy, enabling people and teams to practice their crafts in effective and lasting ways. It is part of the cultural weave of values, norms, knowledge, technology, tools, and practices that shape how we work and why.5
Many people think about devops as specific tools like Docker or Kubernetes, or practices like continuous deployment and continuous integration. What makes tools and practices “devops” is how they are used, not the tools or practices directly.
Site Reliability Engineering (SRE)
In 2003 at Google, Ben Treynor was tasked with leading a team of software engineers to run a production environment. Treynor described SRE as “what happens when a software engineer is tasked with what used to be called operations.”
Over time SRE was a term bandied about by different organizations as a way to describe operations folks dedicated to specific business objectives around a product or service separate from more generalized operations teams and IT.6 In 2016, some Google SREs shared the Google specific version of SRE based on the practices, technology, and tools in use within the organization in the Site Reliability Engineering book 7. In 2018, they followed it up with a companion book “The Site Reliability Workbook” to share more examples of putting the principles and practices to work.
So what is SRE? Site Reliability Engineering is an engineering discipline that helps an organization achieve the appropriate levels of reliability in their systems, services, and products.
Let’s break this down into its components starting with reliability. Reliability is literally in the name “Site Reliability Engineer” so it makes sense. But what does it mean? It is defined differently depending on the type of service or product that is being built.
One measure of reliability is often used in exclusion to any other, and that is availability. Availability describes whether a system or service is available for folks to use. An example of measuring availability would be if I measured that my website is up and running on a specific port and serving pages.
There are other measurements of reliability depending on the system under observation. Examples of other types of reliability include latency, throughput, and durability. Maybe my website is up and running, but it’s not responding in a meaningful amount of time, and I see a drop-off in customers due to the latency.
A third measurement of reliability used with user-facing systems is throughput. Throughput measures how many requests the website can handle.
For example durability might be a different way to measure reliability for general storage or more specialized storage like Hadoop.
These measurements are the basis of service level indicators (SLIs), a way of measuring the reliability of a service. From SLIs, we can then establish the goals that we want to reach or the appropriate levels of reliability. These goals are our service level objectives (SLOs) or SLOs. Every service will have a context-specific value to the effort it takes to reach the next level of reliability. We factor in any dependencies that we have (including network and DNS!) because we can’t have better reliability than what we depend on from different service providers.
Being an engineering discipline means that we approach our work from an analytical perspective to design, build, and monitor our solutions while considering the implications to safety, human factors, government regulations, practicality and cost.8
One of the strong evolution points from regular system administration work was the measurement of impact on humans. This work has been described as toil due to the work being repetitive and manual. Google SRE implemented a cap of 50% toil work, redirecting this work to development teams and management including on-call responsibilities when the toil exceeded the cap.9
By measuring the quality of work and changing who does the work, it changes some fundamental dynamics between ops and dev teams. Everyone becomes invested in improving the reliability of the product rather than a single team having to carry the brunt of all the support work of trying to keep a system or service running. SRE teams are empowered to help reduce the overall toil.
Resources for Exploring SRE
Learn more about Google SRE from the Site Reliability Engineering and The Site Reliability Workbook books.
How do Devops and SRE Differ?
While devops and SRE arose around the same time, devops is more focused on culture change (that happens to impact technology and tools) while SRE is very focused on changing the mode of Operations in general.
With SRE, there is often an expectation that engineers are also software engineers with operability skills. With DevOps Engineers, there is often an assumption that engineers are strong in at least one modern language as well as have expertise in continuous integration and deployment.
While devops and SRE have been around for approximately ten years, the role of system administrator (sysadmin) has been around for much longer. Whether you manage one or hundreds or thousands of systems, if you have elevated privileges on the system you are a sysadmin. Many definitions strive to define system administration in terms of the tasks involved, or in what work the individual does often because the role is not well defined and often takes on an outsized responsibility of everything that no one else wants to do.
Many describe system administration as the digital janitor role. While the janitor role in an organization is absolutely a critical role, it’s a disservice to both roles to equate the two. It minimizes the roles and responsibilities of each.
A sysadmin is someone who is responsible for building, configuring, and maintaining reliable systems where systems can be specific tools, applications, or services. While everyone within the organization should care about uptime, performance, and security, the perspective that the sysadmin takes is focused on these measurements within the constraints of the organization or team’s budget and the specific needs of the tool, application, or service consumer.
I don’t recommend the use of devops engineer as a role. Devops is a cultural movement. This doesn’t stop organizations from using devops to describe a set of tasks and job responsibilities that have eclipsed the role sysadmin.
I’ve spent a fair amount of time reading job requirement listings, and talking to other folks in the industry about devops engineers. There is no single definition of what a devops engineer does in industry (sometimes not even within the same organization!).
While engineers with “devops” in their title may earn higher salaries than ones with “system administrator”10, this reinforces the adoption of the title regardless of the lack of a cohesive set of roles and responsibilities that translate across organizations.
Having said that, “devops engineer” is in use. I will try to provide methods to derive additional context to help individuals understand how to evaluate roles with the title in comparison to their current role.
Finding Your Next Opportunity
One of the reasons you might have picked up this book, is that you’ve been within your position for awhile, and you’re looking to your next opportunity. How do you identify positions that would be good for your skills and experiences and desired growth? Across organizations, different roles mean different things, so it’s not as straightforward as just substituting a new title and doing a search. Often it seems the person writing a job posting isn’t doing the job being described, as the postings will occasionally include a mishmash of technology and tools.
A danger to avoid is thinking that somehow there is some inherent hierarchy implied by the different roles even as some folks in industry or even within an organization assume this. Names only have as much power as we give them. While responsibilities are changing and we need to add and update our skills, this isn’t a reflection of individuals or the roles that they have now.
There is a wide range of potential titles. Don’t limit yourself by the role title itself, and don’t limit your search to just “sysadmin” or even “sre” and “devops”. From “IT Operations” to “Cloud Engineer” the variety of potential roles are diverse.
Before you even examine jobs, think about the skills you have. As a primer, think about what technical stacks are you familiar with? How familiar are you with the various technologies described in this book? Think about where you want to grow. Write all of this down.
As you review job reqs, as you note skills that you don’t have that you’d like to have write those down. Compare your skill evaluation with the job requirements and work towards improving those areas. Even if you don’t have experience in these areas, during interviews if you are able to clearly talk about where you are compared to where you want to be for those skills it goes a long way to showing your pursuit of continuous learning (which is a desirable skill).
Preparing Questions Prior to the Interview
Logan McDonald, a Site Reliability Engineer at Buzzfeed, shares some questions to ask during an interview in this blog post Questions I ask in SRE interviews. While she specifically targets the SRE interview, these are helpful questions for any kind of operations postion to help qualify the direction and responsibility for the position.
Today, sysadmins can be devops engineers or site reliability engineers or neither. Many SRE skills overlap with sysadmin skills. It can be frustrating with years of experience as a sysadmin to see a lack of opportunities with the role sysadmin. If examined, often the roles advertised as SRE or devops engineer have very similar skills and expectations of individuals. Identify your strengths, and compare them with jobs requirements from positions that sound interesting. Map out your path and work on those skills.
Now that you know what to expect in the coming chapters and have some high-level context for how the sysadmin role has changed over the years, let’s dig into it. In the upcoming section, you will start examining tools and technologies to level up core foundational skills and practices that are critical to how you collaborate and work within local environments and communicate about the work that you are doing.
5 Effective DevOps, Davis, and Daniels
6 The Many Shapes of Site Reliability Engineering: https://medium.com/slalom-engineering/the-many-shapes-of-site-reliability-engineering-468359866517
9 Stephen Thorne Site Reliability Engineer at Google, “Tenets of SRE”: https://medium.com/@jerub/tenets-of-sre-8af6238ae8a8