O'Reilly logo

Modern Linux Administration by Sam R. Alapati

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Modern Linux System Administration

Linux (and other) system administration has changed tremendously over the past decade and a half since the advent of internet based web applications and the proliferation of Big Data based systems, and the rush to cloud based systems. A quick perusal of job postings will reveal that organizations are looking for administrators who can handle the seismic changes in IT systems over the past decade. To be successful in this milieu, you need to understand how to work with newer computing paradigms such as cloud based systems, continuous integration and delivery, microservices, modern web application architectures, big data, virtualization, and containerization.

Old line systems administration isn’t obsolete by any means, but as organizations keep moving to the public cloud, there’s less need for traditional system administration skills. However, administrators should pick up the newer skills.

Not to fret if you’re a bash God – you’ll still find plenty of opportunities for using the venerated bash shell scripting skills you’ve mastered and love so much. However, “the times are a changing”, and it’s important to move along with the changing times and learn what makes the modern system administrator tick. For example, since many important tools that are in vogue today are scripted with Ruby, it’s a good idea to learn a languages such as Ruby.

As our goal is to understand the main concepts and tools involved in modern system administration, this book doesn’t presume to explain the fundamentals of system administration. A basic course in Linux administration, such as that offered by Red Hat Linux, shows how to perform routine sysadmin tasks such as creating and managing users, managing files and directories, working with the Linux kernel, managing storage, and so on. To be a modern administrator, you’ll still need to know that stuff. For new system administrators or users, this book shows you what you need to know after you acquire the basics. For the experienced users, this book shows what you need to know to stay relevant in today’s world.

The proliferation of cloud based computing means that you don’t spend time performing traditional tasks such as ordering hardware, and “racking and stacking” that hardware in a data center. Now, you increase your capacity by issuing an API call or by merely clicking a button on the cloud provider’s web page. Instead of spending most of their time managing data centers, sysadmins are increasingly busy writing code, and working on software, with the help of tools such as Chef, Puppet, and Terraform.

Devops is a term that has come increasingly to the forefront in the past few years. Devops is meant to be a way for operational and development teams to work together to speed up the creation and deployment of software. Devops can also refer to job titles, to refer to people who work at the intersection of development and operations, often using newer tools that facilitate faster and more seamless migration of code to production. While I don’t explicitly address this book to devops professionals, many topics that I discuss in this book are highly relevant to people who work in areas that fall into what we call devops.

Regardless of whether your job title has the word “devops” in it, as a system administrator you’ll be working ever more closely with developers. Sysadmins have always worked with developers, through legend has it that the relationship was acrimonious, with sysadmins being accused of undue zealotry in maintaining their fiefdoms. This is Devops propaganda; the reality is much more nuanced and complex. Today, there’s little chance of such a schism between the two groups: either the two groups swim together, or they both sink.

The main thrust of the book is on discussing and explaining the key principles and the processes underlying the way modern businesses are architecting and supporting robust highly scalable infrastructures. The focus is on the role the tools play, and how to integrate them into your work, so you can be a more effective systems administrator.

The best way to benefit from the book is to absorb the main conceptual principles underlying modern systems administration – any tools I discuss in the book are there mostly to illustrate the concepts. Progress is very rapid in this area and new techniques and new tools come on board all the time. New tools may often supplant today’s popular tools, with short notice. Focusing on the conceptual side of things will therefore help you by showing how to solve major problems in developing software, managing web sites, scaling, security, and performance, etc.

Before I summarize the concepts, and tools of modern system administration, let’s quickly review the drawbacks of traditional system administration.

Problems with Traditional Systems Administration

Traditional systems administration concepts date back to several decades, predating the advent of major technological innovations such as the internet, cloud computing, newer networking models, and many others. While the guts of system administration remain the same, the job requirements, and what management expects from system administrators. have changed over the past few years.

Let’s review the changes in some areas of traditional systems administration to learn why one ought to change their basic approach to systems administration, and how modern tools and techniques are transforming the very nature of the system administrator’s role.


Newer monitoring tools help you do much more than the older toolsets. However, the big difference between traditional and modern systems monitoring is the critical importance of application monitoring. Whereas in the past, one focused more on system metrics, today, application or service metrics have come to play an equal or even larger role in ascertaining the health and well-being of systems.

The Image Sprawl problem

A “golden images”, also referred to as as a clone or master image, is a template for a virtual machine, server, and other things for deploying infrastructures. A gold image in this context refers to a known set of good configurations.

While using a golden image can speed up things compared to doing everything from scratch, the strategy tends to exacerbate the problem of image sprawl. Of course, there are tradeoffs between golden images and configuration, but here I want to focus on the issue of image sprawl.

Image sprawl is where multiple images are in deployment, usually in different versions. Images become unwieldy, and management becomes chaotic. As the number of images grows, you’ll find yourself performing regular manual changes, which tend to lead to a deviation from the gold standard.

Agile Development Methodologies and the System Administrator

Agile Operations is the counterpart to agile development, which is an umbrella term for iterative and incremental software development methodologies. Popular agile methodologies include Scrum, Extreme Programming (XP), and Lean Development. Agile practices include frequent, small code rollouts.

A high frequency of code changes means that the operations teams can’t isolate themselves from the development teams, as in the days past. The rigid barriers between the two teams have been gradually coming down due to the high degree of cooperation and interaction between development and operations that the agile development methodologies require.

Cloud environments

Systems administration practices that work well for a company’s data center may not always be transferable as they are when you move to a cloud environment. Storage and networking both are fundamentally different in the cloud, especially an external one. System administrators are expected to understand how to work with public cloud environments such as AWS, Azure and Rackspace.

Impact of Big Data

Traditional warehouses can’t scale beyond a certain point, regardless of how much hardware and processing capacity you throw at them. Increasing amounts of data, especially after web based environments have became common, required a different paradigm, and distributed processing turned out to be the best approach to solving problems posed by big data.

Hadoop and NoSQL databases are here to stay as a platform for storing and analyzing big data. Administrators must know the architecture as well as the principles behind Hadoop, NoSQL databases and other tools that are popular in big data environments.

Manual Operations without automation

Traditional systems are to the most extent still run with a heavy dose of manual operations. Consequently, change management is slow and there are plenty of ways to make mistakes.

Modern trends in system administration (and application development) include the following:

  • Infrastructure automation (infrastructure as code)

  • ** Virtualization and containerization

  • Deployment of microservices (rather than the traditional single behemoth applications)

  • Increasing use of NoSQL and caching databases

  • Cloud environments– both external and internal

  • Big data and distributed architectures

  • Continuous deployment and continuous integration

In the following sections, I briefly define and explain the trends in modern system administration. In subsequent chapters, I discuss most of these concepts and the associated tools in detail.

Automated Infrastructure Management

Configuring the environment in which applications run is just as important as configuring the application itself. If you don’t correctly configure a messaging system for example, the application won’t work correctly. Configuring the operating system, the networks, database, and web application servers is critically important for an application to function optimally.

Ad-hoc configuration changes of course come with perils: a small mistake or a slightly wrong change in configuration can either crash the app, or bring it to its knees – just think of all those Mondays when you were waiting with bated breath as to the fate of the app after making code/configuration changes over the weekend!

To reduce the risk in managing the environment, you should make configuration changes fully automatic. This means that you should be able to reproduce your environment quickly and easily, rather than spend time mucking around with the configuration trying to fix things.

A fully automated process offers the following benefits:

  • It keeps the cost (in terms of effort and delays) of creating new environments low.

  • Automating the infrastructure creation (or rebuilding) process means then when a critical player leaves, things don’t all of a sudden stop working.

  • A fully automated system also helps you easily create test environments on the fly – plus, these development environments will be exact replicas of the latest incarnation of the production environment.

  • You can upgrade to new versions of systems with very little, or even no downtime.

As I explained earlier, all you need to do is to specify which users should have access to what, and which software the tool must install. Simply store these definitions in a version control system (VCS). A version control system records changes that you make to files over time, so you’ll be able to recall a specific version when you need it. Mostly software source code is versioned, but you can version any type of files, including infrastructure configuration files, from where agents will pull the new configuration and perform the required infrastructure changes.

You gain by not having to manually do anything, plus, since everything is flowing through the VCS, the changes are already well documented, providing an effective audit trail for the system changes.

Infrastructure as code

Infrastructure as code is the writing and executing of code to define deploy, and manage your infrastructure. You treat all infrastructure operations, as software, including tasks such s setting up servers. Infrastructure as code is about managing everything in the form of code, including such things as entitles (servers, databases, the network, logs), application configuration, documentation, processes (deployment and testing). Automated infrastructure management goes hand in hand with the treating of infrastructure as code.

The term infrastructure as code is synonymous with configuration management. Tools such as Chef help transform code into infrastructure These types of tools allow you to codify your infrastructure, following software best practices, including storing code in version control, etc.)

In addition, the tools help with setting up development systems. You can create matching infrastructure definitions, that help you deploy configurations on your laptop in a Vagrant + VirtualBox environment with Test Kitchen, that matches your Cloud deployed infrastructure or something you’ve set up locally in your data center.

Infrastructure as code is a simple concept that lets your infrastructure reap the benefits of automation by making an infrastructure versionable, repeatable, and easily testable. CM tools let you fully automate your infrastructure developments as well as automatically scale infrastructure, besides handling infrastructure repairs automatically (self-healing capabilities).

Applications typically lock down their configuration and don’t allow untested or adhoc changes over time. Why should the administrator do any different? You must learn to treat environmental changes as sacrosanct, and work through a structured and formal build, deploy, and test process the same way as developers treat their application code.

While you can’t always build expensive test systems that duplicate fancy production systems, you do need to deploy and configure these test environments the same way as you do the production systems.

Automating infrastructure and application deployment requires more than one simple tool. There’s some overlap among the different types of automation tools, such as the following:.

  • Configuration management (CM) tools: CM tools let you specify the state description for servers and ensure that the servers are configured according to your definition, with the right packages, and all the configuration files correctly created. CM tools install and manage software on servers that already exist. Puppet, Chef, Ansible, and Saltstack, are some of the popular CM tools.

  • Deployment tools: Deployment tools generate binaries for the software that an organization creates, and copy the tested artifacts to the target servers and start up the requested services to support the applications. The tools also help you execute commands in parallel on multiple remote servers, using SSH. Crowbar, Razor, Capistrano, Fabric, Jenkins (primarily a CD/CI tool), and Cobbler are examples of deployment tools.

  • Infrastructure (Server) Provisioning Tools: Orchestrating deployments usually involves deploying to (usually large numbers of) remote servers where you need to deploy the infrastructure components in a specific order. Many times, orchestration is state dependent, where the state of a system depends on another system. Popular orchestration tools includ Terraform, Amazon Web Service’s (AWS) CloudFormation, and OpenStack Heat.

  • Server Templating Tools: Server templating tools work differently from configuration tools. Whereas configuration tools are all about launching servers and configuring them by running specific code on each of the servers, server templating tools are all about images These tools create images of a server, to use in provisioning. The image contains the entire OS, software, all files, and any other relevant artifacts. For example, you can use a tool such as Packer to create a server image, and use a tool such as Ansible to install that image across a set of servers. You can define virtual images as code with Packer and Vagrant, and container images as code with tools such as Docker and CoreOS rkt.

Some of the tools that I listed here meant for just one of the four purposes, such as deployment. For example, a tool such as Jenkins performs purely integration and deployment related functions. Most tools perform more than one function, and thus there’s overlap among some of the tools. A tool such as Ansible can perform all four of these tasks – configuration management, deployment, orchestration, and provisioning.

Automating Redundant Work with Configuration Management Tools

Setting up a new infrastructure or using an existing but unwieldy infrastructure setup isn’t a trivial task for new system administrators. While the laying out of the infrastructure itself is straightforward, it usually involves steps that are inherently error prone. And once you set up an infrastructure with built-in errors, a lot of times you’re forced to live with those errors for lengthy periods of time.

Redundant work and duplication of tasks occupies the scarce time of administrators. Years ago, a single administrator took care of a small number of developers, such as a ten-people team. As tools evolved, system administrators support a much larger number of developers in many places. This is a strong reason system administrators need to have plethora of tools in their toolbelt to maximize their time and effectiveness, and not waste time manually configuring things.

Manual installation and configuration of infrastructure components such as servers and databases isn’t practical in large-scale environments that require you to setup hundreds, and even thousands of servers and databases.

Configuration management software grew out of the need to eliminate redundant work and duplicated efforts. The configuration tools help automate infrastructure work. Instead of manually installing and configuring applications and servers, you can simply describe what you want to do in a text-based format. For example, to install an Apache Web Server, you use a configuration file with a declarative statement such as the following:

All web servers must have Apache installed.

Yes, as simple as this statement is, that’s all you’d have to specify to ensure that all web servers have the Apache web server installed on them.

Configuration management is the process of using tools and strategies to automate the implementation and enforcing of configuration changes to server environments, as well as software and documentation. CM enables you to manage artifacts through the design, implementation, testing, baselining, building, release and maintenance phases. CM tools automate the application of configuration states.

For many years, up until about 10-15 years ago, scripting was the main means of automating administrative tasks, including configuring systems. As infrastructure architectures become ever more complex, the number and complexity of the scripts used to manage these environments also grew complex, leading to more ways for scripted procedures to fail.

In 1993, the first modern configuration management (CM) system, CFEngine, came into being to provide a way to manage UNIX workstations, and servers. In 2005 Puppet was introduced and soon become the leader in the market until Chef was introduced in 2009. Now, you also have other CM tools, including Saltstack, and Ansible. Most CM systems today share the following features:

  • Automating the infrastructure (infrastructure as code)

  • Use of a scripting language such as Ruby, or Python as a configuration language (while other languages such as Go and Rust are becoming more popular, Ruby and Python are currently the major programming language in this area).

  • Extensibility, customizability, and the capability to integrate with various other tools (not all CM tools are equally extensible – there’s a tradeoff with the ease of the tool)

  • Use of modular, reusable components

  • Use of thick clients and thin servers – the configuration tools perform most of the configuration work on the node the tool is configuring, rather than on the server that hosts the tools (nodes drive some of the configuration, and the servers drive the other configuration)

  • Use of declarative statements to describe the desired state of the systems you’re configuring

Good configuration management help you:

  • Reproduce your environments (OS versions, patch levels, network configuration, software, and the deployed applications), as well as the configuration of the environment.

  • Easily make incremental changes to any individual environment component and deploy those changes to your environment.

  • Identify any changes made to the environment and track the time when the changes were made, as well as the identity of the persons who made the changes.

Of course, these are all under ideal circumstances – if for example, folks bypass the CM tools by making unauthorized changes the CM system obviously, won’t be able to track the changes.


When you handle server configuration like software, naturally you can take advantage of a source code management system such as Git and Subversion to track all your infrastructure configuration changes.

IT Orchestration goes beyond configuration, and enable you to reap the benefits of configuration management at scale. Orchestration includes the following:

  • Configuration

  • Zero downtime rolling updates

  • Hot fixes

Let’s say you’re configuring multiple instances of a web server with a CM tool, each of them with a slightly different configuration. You may also want all the servers to be aware of each other following a boot up. Orchestration does all of this for you. An orchestration framework lets you simultaneously execute configuration management across multiple systems.

With a dedicated orchestration tool such as Ansible, you won’t need additional tools for orchestrating things. While Ansible is an orchestration tool, since it permits the description of the desired state of configuration, you can also consider it a CM tool. Often companies use Ansible along with Puppet or Chef as the CM tool. In this case, Ansible orchestrates the execution of configuration management across multiple systems. Some teams prefer to use just Ansible for both configuration and orchestration of an infrastructure.

Automatic Server Deployment

With the proliferation of server farms and other large scale server deployment patterns, you can rule out old-fashioned manual deployments. PXE tools such as Red Hat Satellite for example, help perform faster deployments, but they still aren’t adequate, since you need to perform more work after deploying the operating system to make it fully functional.

You need some type of a deployment tool (different from a pure configuration tool such as Puppet or Chef), to help you with the deployment. For virtual servers, VMware offers CloudForms, which helps with server deployment, configuration management (with the help of Puppet and Chef) and server lifecycle management.

There are several excellent open source server deployment tools such as the following:

  • Cobbler

  • Razor

  • Crowbar

  • Foreman

Razor, for example, is an automated provisioning tool that can install servers and work with your CM tools. Razor automatically detects new nodes, registers and installs them and configures an OS on the nodes. Subsequently, it can also hand off the server to a CM tool such as Chef or Puppet.

A tool such as Razor can take you from a bare metal or a virtual machine with nothing on it to a fully configured node, and have it managed by Puppet or Chef.

In Chapter 7, I show how tools such a s Razor and Cobbler help with the automatic deployment of servers.

Provisioning: Spinning up virtual environments

It’s quite hard to develop and maintain web applications. The technologies, code and the configuration keep changing all the time, making it hard to keep your configuration consistent on all servers (prod, staging, testing, and dev). Often you learn that a configuration for a web server or a message queue is wrong, only after a painful production screw up.

Virtualized development environments are a great solution for keeping things straight. You can set up separate virtualized environments for each of your projects. This way, each project has its own customized web and database servers, and you can set up all the dependencies each of the projects needs without jeopardizing other projects.

A virtual environment also makes it possible to develop and test applications on a production-like virtual environment. However, virtual environments aren’t a piece of cake to set up and require system administrator skillsets. Vagrant to the rescue!

Vagrant is a popular open source infrastructure provisioning tool. Vagrant lets you easily create virtualization environments and configure them using simple, plain text files to store the configuration. Vagrant makes your virtual environments portable and shareable with all team members. With Vagrant, you can easily define and create virtual machines that can run on your own system.

Vagrant lets you quickly start up virtual machines according to their definitions in a template file called the Vagrantfile. The Vagrantfile plays the same role for Vagrant as the Chef recipe plays for Chef. The file contains the configuration information for a virtual machine or for a cloud infrastructure.

You can generate Vagrant based virtual machines with Oracle’s VirtualBox or VMware as the providers. Once you’ve configured the Vagrantfile, deploying a new environment is as simple as typing in the following two-word command:

$ vagrant up

A Vagrant box is a sort of VM template, and is a preinstalled OS instance that you can modify according to settings you store in the Vagrantfile. The Vagrant boxes have an extremely light foot print and therefore present a bare-bones OS whose configuration you can customize with a CM tool such as Puppet or Chef.

The Vagrant configuration file (Vagrantfile) is written in text files using a Ruby DSL. You can therefore share these configurations with others using a VCS such as Git or Subversion.

It’s common to hear the phrase “but it works on my server” when things go wrong in production. With Vagrant, your development environments can easily mimic production environments, so you know things will work the same on either environment. You can also modify the configuration for one project without adversely impacting other projects in your environment.

Since Vagrant integrates with various hypervisor and cloud providers, you can use it to provision both an on premise virtual infrastructure as well as a full-blown cloud infrastructure.

In Vagrant’s terminology, you use two types of entities - providers and provisioners - to configure a development environment:

  • Providers are the solutions such as VirtualBox and VMware for virtual machines and Amazon AWS for cloud environments.

  • Provisioners are tools that help manage the development environment configurations. Chef and Puppet are two examples of Vagrant provisioners – both tools help you automate the application automation process on Vagrant boxes.

Server (hardware) Virtualization

Up until the past few years, all applications ran directly on an operating system, with each server running a single OS. Application developers and vendors had to create applications separately for each OS platform. Hardware virtualization provides a solution to this problem by letting a single server run multiple operating systems or multiple instances of the same OS. This of course lets a single server support multiple virtual machines, each of which appears as a specific operating system, and even as a hardware platform.

Hardware virtualization separates hardware from a single operating system, by allowing multiple OS instances to run simultaneously on the same server. Hardware virtualization simulates physical systems, and is commonly used to increase system density and hike the system utilization factor. Multiple virtual machines can share the resources of a single physical server, thus making fuller use of the resources you’ve paid for.

Types of Hardware Virtualization

You implement hardware virtualization through a hypervisor, which is software that sits on the host server and lets you create and manage virtual machines on a host.

The function of the hypervisor is to virtualize all the underlying host resources such as the processors, RAM, storage and networks and apportion them among the virtual machines running on top of the hypervisor.

The hypervisor is the broker between the host system and the virtualized guest systems. The hypervisor contains a virtual machine manager (VMM) that’s responsible for managing the virtual machines running on the server.

There are two main types of hypervisors, Type-1 and Type-2, based on where the hypervisor is located:


The hypervisor abstracts the underlying physical hardware from the guest operating systems that runs on the virtual machines. Each of the applications running on the virtual machines sees only the resources (CPU, storage, memory and network) provided by its VM instead of the entire resources of the underlying physical server.

Virtualization took a long time to evolve. IBM led the way in the 1970’s with systems that let programs use a portion of the system resources. Virtualization entered the mainstream in the early 2000s, when VMs started becoming available on the popular x86 servers. Organizations started realizing that most of their servers were being underutilized, often using less than 10 percent of their processing capacity on the average. On top of this, power and cooling expenses of running data centers became a preoccupation with many CIOs, and virtualization came to be a great solution for both increasing the usage of existing computing capacity and the reduction of data center operating costs.

Benefits of virtualization

By adopting a virtualization strategy, you can:

  • Easily create new environments. All you need is do is make a copy of the server environment and store it as a baseline. You can then create the new environment with minimal effort.

  • Consolidate hardware

  • Standardize the hardware platform

While virtualization does introduce additional complexity into the management of the servers, it has tremendous potential to save money and enhance consolidation in data centers. Studies, for example, show that data centers on average use only 6-12% of the available electricity powering the servers to actually perform computations.

A virtualization layer can measure the utilization of the servers on which they run. This enables the layer to add additional VMs to a server until a predefined utilization level (for example, 70%) is reached. This is much better than the current estimated utilization rate of non-virtualized servers in most data centers, which hovers around a paltry 4-7%. It’s common to realize server consolidation ratios of 7x (that is, the number of physical servers replaced by one physical server supporting multiple VMs) or even 10x are common.

Containerization – the new Virtualization

Container virtualization is newer than virtualization and uses software called a virtualization container that runs on top of the host OS to provide an execution environment for applications. Containers take quite a different approach from that of regular virtualization, which mostly use hypervisors. The goal of container virtualization isn’t to emulate a physical server with the help of the hypervisor.

All containerized applications share a common OS kernel and this reduces the resource usage since you don’t have to run a separate OS for each application that runs on a host.

Virtualization systems such as those supported by VMware let you run a complete OS kernel and OS on top of a virtualization layer known as a hypervisor. Traditional virtualization provides strong isolation among the virtual machines running on a server, with each hosted kernel in its own memory space and with separate entry points into the host’s hardware.

Under traditional virtualization, hypervisors and guest operating systems have a heavy footprint in terms of their resource usage. Since containers execute in the same kernel as the host OS and share most of the host OS, their footprint is much smaller than that of hypervisors and guest operating systems under traditional virtualization. Thus, you can run a lot more containers on an OS when compared to the number of hypervisors and guest operating systems on the same OS.

Containers are being adopted at a fast clip since they are seen as a good solution for the problems involved in using normal operating systems, without the inefficiencies introduced by virtualization. New lightweight operating systems such as CoreOS have been designed from the ground up to support the running of containers.


Containers don’t have the main advantage provided by hardware virtualization such as the virtualization provided by VMware, which can support disparate operating systems. This means, for example, that you can’t run a Windows application inside a Linux container. Containers today are really limited to the Linux operating system only.

A key benefit of using Docker containers is that the applications are divorced from the underlying hardware and OS – the latter are abstracted away. You can scale an application not only within a private data center, but even across public cloud providers.

Amazon, Google and RedHat all offer support for the implementation of Docker containers in a public cloud, with the following services:

  • Amazon EC2 Container Service (Amazon ECS)

  • Google Container Engine

  • Red Hat OpenShift 3

Joyent Triton (SmartOS) and Microsoft Azure (Windows, of course) are ways in which cloud providers on non-Linux operating systems are supporting Docker.

Running multiple versions of an application on the same server often leads to conflicts when different applications require different system libraries or runtime languages. You can use workarounds to handle these conflicts, such as installing both new and old sets of libraries on the same server so different applications can use different libraries. However, this strategy only leads to higher application complexity when it comes to configuration. Dependency management is a lot easier when you simply isolate applications.

Linux containers provide the required isolation between applications, in order to prevent configuration and runtime dependency conflicts. Containers let you combine an application and all of its dependencies into one package, which you can make a versioned artifact.

Most of the benefits offered by containerization stem from two key concepts – Linux control groups and namespaces


Linux Control groups, also called cgroups, instruct the kernel as to how many resources (CPU and RAM) it should allocate to various processes running on a server. Containers appear as processes to the OS, and cgroups will try and allocate resources fairly to the containers.


Namespaces constitute the second important kernel feature that makes containerization possible. Namespaces provide boundaries between applications. There are different namespaces such as Mount (file systems and mount points) Network (networking), PID (Process), and User (User IDs).

Namespaces provide isolation between applications by providing boundaries for the applications – when an application runs within the boundary of a Linux namespace, it’s deemed to be running inside a container.

The boundaries provided by namespaces let each application have its own network stack and file system. For example, the mount namespace eliminates dependency conflicts between applications by providing isolation at the file system level. Since each application sees a file system independent from the file system of other applications, it can install conflict-free application dependencies. Similarly, network namespaces keep two applications from conflicting by providing each of these containers its own network stack – each network may bind to the same TCP port number, but you won’t see a conflict when that happens.

While you can use tools such as LXC to automate the creation of Linux containers, you are still left with several complications as regards the management of containers. These issues include the need to edit configuration files for controlling resources, and the satisfying of the prerequisites to run a container OS on a host system. Enter Docker.

Docker and Containerization

Docker is an application that offers a standard format in which to create and share containers. Docker didn’t materialize out of thin air: it extends the contributions made by LXC, cgroups and namespaces to simply the deployment and use of containers.

Docker lets you automate container creation and also provides an on-disk format that makes it easy to share containers between different hosts. Containers ship with dependencies using this Docker format. In a nutshell, Docker transforms your applications into self-contained portable units.

As mentioned earlier, it was Google that started developing CGroups for Linux and also the use of containers for provisioning infrastructure. In 2008 the LXC project was started, combining the technology behind CGroups, kernel namespaces and chroot, as a big step forward in the development of containers. However, using LXC to run containers meant a lot of expert knowledge and tedious manual configuration.

It was left to Docker to complete the development of containers to the point where companies started adopting them as part of their normal environment, starting around 2012. Docker brought containers from the shadows of IT to the forefront.

Docker did two main things: it extended the LXC technology and it also wrapped it in user friendly ways. Of course, in addition, it also popularized the analogy of a shipping container, which , along with the catchy name “docker”, helped t to catch so rapidly. This is how Docker became a practical solution for creating and distributing containers. Docker makes it easy to run containers by providing a user friendly interface. In addition, Docker helps you avoid the reinventing of the wheel, by providing numerous prebuilt public container images.

The Docker platform consists of a Docker Engine that creates and runs containers, and Docker Hub, which is a cloud service that distributes the containers. It’s common to deploy the infrastructure on Docker containers on a CoreOS platform and use a pipeline such as GoCD to deploy the infrastructure. A key source of Docker’s strength is the huge open source community around it that helps with fixing bugs and driving enhancements to the core technology.

Docker Container Orchestration and Distributed Schedulers

Managing and deploying a large number of containers isn’t a trivial concern. When you deploy containers on multiple hosts, you not only need to worry about the deployment of the containers, but you also need to concern yourself with the complexities of inter-container communications and the management of the container state (running, stopped. failed, etc.). Figuring out where to start up failed servers or applications and determining the right number of containers to run, all are complex issues.

Containers decouple processes from the servers on which the containers run. While this offers you great freedom in assigning processes to the servers in your network, it also brings in more complexity to your operations, since you now need to:

  • Find where a certain process is running right now

  • Establish network connections among the containers and assign storage to these containers

  • Identify process failures or resource exhaustion on the containers

  • Determine where the newly started container processes should run

Container orchestration is the attempt to make container scheduling and management more manageable. Distributed schedulers are how you manage the complexity involved in running Docket at scale. You simply define a set of policies as to how the applications should run and let the scheduler figure out where and how many instances of the app should be run. If a server or app fails, the scheduler takes care of restarting them. All this means that your network becomes a single host, due to the automatic starting and restarting of the apps and the servers by the distributed scheduler.

The bottom line is to run the application somewhere without concerning yourself with the details of how to get it to run somewhere. Zero downtime deployments are possible by launching new application versions along the current version, and by gradually directing work to the new application.

There are several container orchestration and distributed scheduling tools available, as the following sections explain.


Fleet from CoreOS is one of the earliest distributed scheduler tools. Fleet works with the system daemon on the servers in order to perform as a distributed init system. Thus, it’s best used in operating systems with systemd, such as CoreOS and others. In addition, you need etcd for coordination services.


Kubernetes, an open source tool initiated by Google, is fast becoming a leading container orchestrator tool. Kubernetes helps manage containers across a set of servers. Kubernetes uses the concept of a pod that represents a group of containers, running on a host that shares both the network and the storage system.

You can run multiple pods on a single server or across a cluster for high availability and efficient resource utilization. Unlike Fleet, Kubernetes demands far fewer requirements from the OS and therefore you can use it across more types of operating systems than Fleet.

Kubernetes contains features that help with deploying of applications, scheduling, updating, scaling, and maintenance. You can define the desired state of applications and use Kubernete’s powerful “auto features” such as auto-placement, auto-restart, and auto-replication to maintain the desired state.

Kubernetes requires a separate network layer. You use a tool such as Flannel to provide this network overlay. Flannel provides an IP (Internet Protocol) over UDP (Universal Data Protocol) layer that overlays your actual network. Flannel requires etcd, a distributed key-value store, for it to provide its cluster wide container orchestration.

In addition to supporting Docker daemons managed by Kubernetes, Flannel also works with other backends such as Rackspace, Azure and the Google Compute Engine.

CoreOS is a bare-bones operating system expressly built to deploy Kubernetes and etcd to manage containers. When you use a CoreOS cluster with Kubernetes, besides etcd and Flannel, you’d also need to use Fleet, which helps deploy applications on a CoreOS cluster.

I know that your head is probably spinning at this point, what with all my namedropping about the various tools. Not to worry - I explain how it all works in detail in Chapter 5, which deals with managing containers with Docker and Kubernetes.


Docker provides a native clustering tool named Swarm, which lets you deploy containers across a large pool. Swarm presents a collection of Docker hosts as a single resource. While Swarm is lightweight and has fewer capabilities than either Kubernetes or Mesos, it’s adequate for many purposes. You can use a single Docker Swarm container to create and coordinate container deployment across a large Docker cluster.

Docker Swarm lets you deploy Docker containers across a pool of Docker hosts. Alternately, you can use the Centurion or Helios tools to make life easy when deploying across multiple hosts.

Cluster Management and Cluster Operating Systems

Distributed services use clusters of servers for both redundancy as well as or scaling purposes. Clusters provide many benefits but also bring their own unique problems. Chief among these is the efficient allocation and scheduling of resources to the various services running in the cluster. Cluster operating systems are the answer to this problem, with Apache Mesos being the most well-known of these types of systems. Here is a short list of the most important cluster operating systems.

  • Mesos and Marathon: Mesos is a kernel/operating system for distributed clusters. It supports both .war files and Docker containers. Marathon is a scheduler that helps schedule and run jobs in a Mesos cluster. It also creates a private Platform –as-a-Service on top of the Mesos framework.

  • Fleet is for Docker containers

  • YARN (Yet Another Resource Negotiator) is the processing component of Apache Hadoop and together with the storage component of Hadoop - HDFS (Hadoop Distributed File System), forms the foundation of Hadoop 2.

Any cluster resource manager in a cluster should be able to support the goals of isolation, scalability, robustness, and extensibility.

Apache Mesos, which became a top-level Apache project in 2013, is becoming increasingly popular as a cluster manager that helps improve resource allocation by enabling the dynamic sharing of a cluster’s resources among different frameworks. Twitter and Airbnb are among the production users of Mesos.

Some people consider Mesos as the gold standard for clustered containers, and well-known companies use it support their large-scale deployments You can also use Mesos for managing Docker environments. Mesos is much more powerful than Kubernetes but requires you to make more decisions to implement it as well.

Mesos is a framework abstraction and lets you run different frameworks such as Hadoop as well as Docker applications (there are projects in place to let even Kubernetes run as a Mesos framework) on top of the same cluster of servers. Mesosphere’s Marathon framework and the Apache Aurora project are the two frequently used Mesos frameworks that support Docker well.

Apache Mesos is a mature technology used by well-known companies to support their large-scale deployments. Kubernetes and Apache Mesos provide similar features as what the public cloud vendors offer with their proprietary technology.

Mesos does the same for the data center as what the normal operating system kernel does for a single server. It provides a unified view and easy access to of all the cluster’s resources. You can use Mesos as the center piece for your data center applications, and make use of its scalable two-phase scheduler, which avoids the typical problems you experience with a monolithic scheduler. Chapter 10 discusses Mesos in detail.

Distributed operations and remote command execution tools

Distributed jobs run across multiple servers. System administrators, for course, use tools to monitor a set of servers and use configuration tools to perform the server configuration updates, and related tasks. You can also use CI (Continuous Integration) tools such as Jenkins to manage some general infrastructure management tasks. But the best way to schedule both ad-hoc and scheduled jobs across multiple servers is to use a remote command execution tool. All the following can help you execute commands across multiple hosts:

  • Fabric

  • Capistrano

  • Mcollective

MCollective is a great tool for parallel job execution that you can use to orchestrate changes across a set of servers in near real-time. Mcollective works well with CM tools such as Puppet and Chef.

While CM tools are used to ensure complete consistency of configuration, a tool like Mcollective orchestrates specific actions across systems much faster than CM tools.

There are simple parallel execution tools based on the Parallel Distributed Shell (pdsh), an open source parallel remote command execution utility, or you can even script one yourself. However, these tools are primitive because they loop through the system in order, leading to the time drift issue A time drift issues crops up wen a server isn’t syncing to an NTP server). They also can’t deal with deviations in responses and make it hard to track fatal error messages. Finally you can’t integrate these tools with your CM tools.

A tool such as Mcollective overcomes the drawbacks of the plain vanilla parallel execution tools and allows you execute commands parallelly on thousands of servers belonging to different platforms. Chapter 7 shows how MCollective and Capistrano work.

Version Control Systems

Version control is the recording of changes to files so you can recall specific versions of the files at a later time. Although version control systems are mostly used by developers (and increasingly by system administrators), source code isn’t the only thing you can version – the strategy of keeping multiple versions of the same document in case you’ll need them later is applicable to any type of document.

Version control systems (VCSs) are nothing new – most of us are quite familiar with these systems, variously referred to as source control systems, revision control systems or source code management systems. However, early version control systems such as CVS (Concurrent Version System) have been supplanted over the past several years by far more powerful open source tools such as Git, Mercurial and Subversion.

Version control systems offer several benefits:

  • Take a project (or some of its files) back to a previous state

  • Find out which user made the modifications that are giving your team a headache

  • Easily recover from mistakes and the loss of files NOTE: Key practices such as continuous integration and automated deployments depend critically on the usage of a central distributed version control repository.

Application development teams have been using version control systems for several years now, to track and maintain application code. However, it’s important to version your infrastructure, configurations and databases, along with you application code. You must script all your source artifacts.

The most commonly used version control tools today are the following:

  • Git

  • Mercurial

  • SVN (subversion)

Git and Mercurial are open source, whereas Perforce is a pay option. There’s also a hosted git option, through GitHub. Of the three tools, Git has become enormously popular, for several reasons.

When you store your configuration definitions in a version control system, you can also set up things so that any configuration changes you commit automatically trigger actions to test and validate changes. This is how continuous integration is triggered for infrastructure changes, so the changes are automatically tested and validated. You therefore integrate a CI or CD tool such as Jenkins, or TeamCity with your CVS. Since configuration definitions are inherently modular, you can run implementation actions on only those parts of the infrastructure code that you’ve changed.

It’s well known that you place source code under version control. What’s not so well understood is that in addition, the following should also be placed under version control:

  • Scripts: these include database scripts, build scripts, deployment scripts etc

  • Documentation: includes the requirements specifications for analysts

  • Configuration: libraries and configuration files for the applications

  • Testing information: Test scripts and test procedures

  • Development information: You must store all development environment related information, so all users can build and work with identical development environments

  • Whatever you need to recreate environments: You must also store everything that you need to recreate the application in production (as well as testing) environments for your applications.


You need to store everything you need to recreate both your environment and your applications in a version control system.

A good version control system should enable you to recover the exact system at any point in time. This means that you must version control the application software stack and the OS configuration, as well as the configuration of the rest of the infrastructure, such as firewall and DNS configuration.

As a system administrator, you need to ensure that you follow a similar strategy as the developers in terms of storing the source code and configuration data. Just piggybank on the source control systems and store binary images of the application servers, base operating systems, virtual machines and other such things under version control. You can control not only the image sprawl, but also get new environments up and running really fast. You can store complete environments as virtual images if you wish, for effortless (well, relatively speaking!) deployments.

Centralized Version Control Systems

A Centralized Version Control System (CVS) uses a single server to store all the versioned files. Clients check out files from this central repository. This was the standard way of version control for many years. Tools such as CVS, Subversion and Perforce are all centralized version control systems. CVSs offer a big improvement over the storing of versions on local databases in multiple locations. CVSs help you share the artifacts, and also help you control which users are enabled to perform actions with the artifacts stored in the central repository.

Distributed Version Control Systems

In a Centralized Version Control System, only the central server maintains a complete copy of the code repository. The popular version control system Git is a distributed version control system, meaning that you make local copies of a central repository instead of merely checking out individual files from the repository. You synchronize local commits to the central repository and users will always pull the latest source version from it. Every checkout from the source amounts to a full backup of all the data. If a server dies, you can copy any of the repositories on the client to the server, in order to restore the server.

Distributed version control systems (DVCS) make it easy for developers and teams to branch and merge work streams. A DVCS contains the complete history of a project. They make it easy to work offline, commit your changes locally and push the changes to other team members. In a centralized VCS, you don’t need to push your changes to a local repository, but a DVCS requires you to do. In addition a DVCS requires that you reconcile updates from their repositories with your own repository (local) before you’re allowed to make changes to your working copy.

You can select one of the many free online Git repositories, such as Bitbucket or GitHub as the central location or repository for storing your code. All repositories use similar methods to obtain code and store code.

GitHub serves as a great example to how a DVCS can accelerate the pace of development. Unlike in a traditional centralized repository, users can fork the repository and make their changes before asking the original repository owner(s) to pull your changes. In a traditional source code repository, the committers need to accept or reject changes from contributors. Sometimes it takes several weeks and even months for these for people to take these actions, leading to a slowing down of the developmental efforts. DVCS make collaboration fast and fruitful, thus encouraging committers to pursue their changes with much greater enthusiasm.

Chapter 7 discusses version control systems, including Git and GitHub.

Continuous Integration and Continuous Deployment

Continuous integration is the broad name given to the automating of the software build and deployment process.


the primary purpose behind a CI or CD system is to catch bugs when they’re young, that is, fairly early in the development and testing process.

There’s often some confusion between the similar sounding terms Continuous Integration (CI), and Continuous Delivery (CD). Here’s how the two terms differ:

  • Continuous integration, also referred sometimes as Continuous Staging, involves the continuous building and the acceptance testing of the new builds. The new builds aren’t automatically deployed into production. They’re introduced into production manually, after approval of the acceptance testing process.

  • Continuous delivery is where new builds are automatically pushed into production after acceptance testing.

In general, organizations use CI when they first start out along the road of continuous testing and deployment. They need to develop an extremely high degree of confidence in their build strategies and pipelines before moving to a CD approach. In any case, since both CI and CD really use the same set of processes and tools, you often are OK regardless of what you call the deployment approach (CI or CD). CI is often a prerequisite for CD.


Sometimes there’s confusion between the terms Continuous Delivery and Continuous Deployment The first terms refers to the delivery of tested software and the second, the software’s deployment into a production environment.

A CI tool such as Jenkins or Travis Continuous Integration (Travis CI) lets you create efficient build pipelines that consist of the build steps for the individual unit tests. In order to be promoted to the staging phase, the source code, after it’s built into a binary package, must pass all the unit tests you define in the test pipeline. The compiled code without errors is called a build artifact and is stored in a directory that you specify.

There are two basic ways in which a CI system can find out about committed changes:

  • The CI system polls the VCS for changes

  • The VCS triggers the CI system to generate new builds following each commit of the source code

Automating the acceptance testing process doesn’t mean that you can go ahead and disband your QA team! Regardless of how sophisticated your CI tools are, there’s always going to be a need for manual testing at some point or the other – the tools that we have can’t really test everything. Especially when it comes to ascertaining whether your software satisfies the user requirements, manual verification by the QA team is quite important.

Continuous Application deployment

Automating deployment speeds up the flow of software deployment. Many organizations have automated their software deployments in a way that enables them to introduce software changes several times every day.

Automated deployment reduces the software cycle times by removing the human component from deployments. The end result is fast and frequent, high quality deployments. Anyone with the appropriate permissions can deploy software as well as its environment by simply clicking a button.

It’s quite common for project members to delay acceptance testing until the end of the development process. Developers may be making frequent changes and even checking them in, and running automated unit tests. However, in most cases there’s no end to end testing of the application until the very end of the application.

Unit tests are of limited value where the tests don’t occur in a production-like environment. To be fully useful, you must combine unit tests with other things, such as integration testing. Instead, the project managers often schedule an elaborate and time consuming integration testing phase when all the development concludes. Developers merge their branches and try to get the application working so the testers can put the app through the paces of acceptance testing.

What if you can spend just a short few minutes after adding new changes to see if the entire app works? Continuous integration (CI) makes this possible. Every time one of the developers commits a change, the complete application is built and a suite of automated tests is run against the updated complete application. If the change broke the app, the development team needs to fix it right away. The end result is that at any given time, the entire application will function as it’s designed.

A continuous integration tool such as Jenkins integrates with a VCS such as Git, which lets you automatically submit code to Jenkins for compiling and testing executions soon as the developers commit new code to the VCS. Later, the developers and others can check the stages of the automated test and build cycles using the job history provided by Jenkins as well as other console output.

Continuous Integration results in faster delivery of software as your app is in a known functioning state at all times. When a committed change messes the app up, you’ll know immediately and can fix it immediately as well, without having to wait for the lengthy integration phase of testing at the very end of development. It’s always cheaper in terms of effort and time to catch and fix bugs at an early stage.

Tools for Implementing Continuous Integration

A CI tool regularly polls the source code system and if It detects any changes, it checks out all the files, builds and runs the code against a sandbox environment. It also produces the results in on a dashboard so you can view them, and it can also send out email notifications.

Although CI is more a set of practices than a specific tool, tools do play a critical role in implementing CI, and there are several excellent open source tools such as the following:

  • ** CruiseControl: Is a simple, easy-to-use tool that enables you to define custom build processes that run continuously at a specific frequency such as every minute, hour, or day.

  • Jenkins: The most popular CI tool out there today with a huge community of users and numerous plugins for anything you may want to do with respect to CI/CD.

In addition, you can check out the following commercial CI servers, some of which have free versions for small teams.

  • TeamCity (JetBrains): contains numerous out-of-the box features to let you get started easily with CI

  • Go (ThoughtWorks): Is an open source tool that descends from one of the earlier CI server named CruiseControl. Delivery Pipelines are the strong feature of Go, and the tools excels at visualization and configuring the pipelines.

  • Bamboo (Atlassian): Bamboo is a CI server that you can use to automate the release management for applications, creating a continuous delivery (CD) pipeline.

In chapter 8, which deals with continuous integration, I explain the concepts behind the popular CI tools Jenkins and Hudson.

Applications use both unit testing and acceptance testing on a regular basis to test software changes. Unit tests are for testing the behavior of small portions of the application and don’t work their way through the entire system such as the database or the network. Acceptance tests, on the other hand, are serious business: they test the application to verify its full functionality as well as its availability and capacity requirements.

How did companies perform integration before they started adopting the new CI tools? Well, most teams used the nightly build mechanism to compile and integrate the code base every day. Over time, automated testing was added to the nightly build processes. Further along the road cane the process of rolling builds, where the builds were run continuously instead of scheduling a nightly batch process. Continuous builds are just what they sound like: as each build completes, the next build starts, using the application version stored in the VCS.

Automatic Build Tools

CI and CD tools such as Jenkins and TeamCity integrate with various build tools to automate their processes. Let’s take a quick look at the popular build tools that are used in CI/CD.

  • Apache Ant: Ant is a build tool that’s somewhat similar to make, the OS build tool in Linux and UNIX.

  • Apache Maven: Maven is a popular build, deployment and dependency management tool that emphasizes convention (use of defaults settings) over configuration. Maven eliminates the drudgery involved in writing custom build scripts that do the same things for every project, such as compiling code, running tests and packaging, steps that are common for all projects.

  • Gradle: Gradle is a build and deployment automation tool designed for use with Java based projects. It uses a Domain specific Language (DSL) that’s based on Groovy to write build scripts.

Note that Ant, Maven and Gradle are all primarily tools for Java based projects. Chapter 8 shows how build tools such as Ant and Maven are used for automating development.

Log Management, Monitoring and Metrics

Traditional log management has involved scripting things to parse voluminous server logs to divine the root cause of system failures or performance issues.

However, the advent of the web means that you now have huge amounts of logs, which requires that you find means of analyzing these logs to glean useful information, Companies have lots of critical business information that’s trapped by the web logs of users (such as who had access to what and when for compliance and auditing purposes, or for analyzing user experience), but many organization aren’t ready to mine these humongous troves of data.

Logs today don’t mean just old fashioned server and platform logs – they mean so much more. Log management in the contemporary sense mostly refers to the management of logs of user actions on a company’s websites, including clickstreams (the path a web visitor takes through a website is called a clickstream. Clickstream analysis is the collection, analysis, and reporting of aggregate data about the pages visited by a website user) etc. These logs yield significant insights into user behavior, especially if one can correlate the logs to understand user behavior Thus, it’s the business managers and product owners that are really the consumers of the products of the log management tools, not the system administrators.


The term logs is defined quite broadly in the context of log management, and not limited to server and web server logs. From the viewpoint of Logstash (an open source tool for collection, parsing and storing logs), for example, any data with a time stamp is a log.

A distributed system can produce humongous amounts of logs. You definitely need to use tools for collecting the logs as well as for visualizing the data.

Log management tools help you manage and profit from logs. Although originally tools such as Logstash were meant to aggregate, index and search server log files, organizations increasingly use them for more powerful purposes. Logstash combined with Elasticsearch and Kibana (ELK, but note that this has been formally renamed to Elastic Stack by Elastic, the company that drives these tools)is a very popular open source solution for log analysis. three tools in the ELK stack do:

  • ElasticSearch: a real-time distributed search and analytics engine optimized for indexing and based on Lucene, an open source search engine known for its powerful search capabilities

  • Logstash: a log collection, analyzing and storing framework used for shipping and cleaning logs

  • Kibana: a dashboard for viewing and analyzing log data

Proactive Monitoring

You need to have two entirely different types of monitoring for your systems. The first type of monitoring is more or less an extension of traditional server monitoring. Here, tools such as Nagios (local) and New Relic Server (hosted monitoring solution) provide you visibility into key areas bearing on system performance, such as CPU and memory usage, so you can fix problems when they rear their ugly head.

The second and more critical type of monitoring, especially for those organizations who are running complex web applications, are the application performance monitoring tools. Tools such as New Relic APM enable code-level identification and remediation of performance issues.

The goal in both types of monitoring is to let everybody view the available application and server performance data – so they can make better decisions to improve performance or resolve issues.

System administrators regularly track system metrics such as OS and web server performance statistics. However, service metrics are very important too. Among other things, service metrics reveal how your customers are using your services, and which areas of a service can benefit from enhancements. Apache Hadoop has numerous built-in counters that help you understand how efficient MapReduce code is. Similarly, Codehale’s Metrics Library (a Go library that provides light-weight instrumentation for applications) provides counters, timers or gauges to for JVM. It also lets you send the metrics to Graphite or another aggregating and reporting system.

Cloud Computing

In the past few years, especially over the past 5-6 years, there has been a growing outsourcing of hardware and platform services, through the adoption of cloud computing. The reasons for the headlong rush of companies to cloud based computing aren’t hard to figure out. When you move to the cloud, you don’t need to spend as much to get a new infrastructure in place and hire and train the teams to support it.

Applications such as ERM (Enterprise Risk Management) that are notoriously difficult to implement successfully, can be had at a moment’s notice by subscribing to a cloud based application. Enhanced performance, scalability and reliability and speed of implementation are some of the main drivers of the move to cloud-based computing.


Platform-as-a-Service means that vendors can support the entire stack required for running mission critical applications, such as the web servers, databases, load balancers etc. You can monitor and manage the cloud based infrastructure through web based interfaces.

Cloud based providers such as Amazon Web Services (AWS) Microsft Azure, and Rackspace Cloud have wowed both system administrators and developers with their ability to provide vast amounts of on-demand computing power at the mere push of a button. You can easily spin up and spin down server capacity as your demand fluctuates.

However, for many organizations, especially small to middle sized ones, it makes a lot of sense to simply outsource the entire task of running the infrastructure and building a new data center to a service such as Amazon Web Services (AWS). AWS (as well as Microsoft’s Azure and Rackspace Cloud) is a highly optimized and efficient cloud infrastructure and if the cost analysis works for you, it may be the best way to move to a cloud-based environment. After all, if you want to go the route of something such as AWS, all you need is a valid credit card and you could be setting up a cloud environment and running a web application or a big data query in a matter of a few hours!

What Cloud Computing is all about

Cloud computing means different things to different people. I therefore take the convenient short cut of using the National Institute of Standards and Technology (NIST) definition for cloud computing, even though it’s not something that everybody accepts without quibbling.

The NIST definition for cloud computing is no mere one liner. Rather, it’s a set of three components relating to important characteristics of a cloud environment, the various cloud deployment models and different cloud service models. Following is a brief explanation of the three major components of a cloud environment, as defined by the NIST.


Not every web based service qualifies as cloud computing. According to the NIST definition of cloud computing, in order for a service to be considered cloud computing, an offering or service must satisfy the following five requirements.

  • On-demand self-service: users should be able to request and get access to the service offering, without the intervention of administrators. User self-service relies on either a custom or an out-of-the box user portal. The backend systems must have the APIs to support the portals. Remember that sometimes you won’t be able to implement certain self-service features, due to compliance and regularity requirements.

  • Broad network access: users should be able to access a cloud service easily, with a basic network connection and a thin client, or even no client at all. The services should also be accessible from different types of devices.

  • Resource pooling: unlike in traditional systems, cloud-based systems must pool their resources, so the resources which are unused by some customers can be utilized by other users. Resource pooling can be provided through virtualization, and it lowers costs and enhances the flexibility of the cloud providers.

  • Rapid elasticity: the environment must support the rapid shrinking and expansion of the computing resources through built-in automation and orchestration. Elasticity solves a major computational problem of traditional environments, which are forced to support a much higher computing capacity at all times, although the users may need the peak capacity only for occasional bursts (hence the term “burst capacity”) that last only for short periods of time.

  • Measured Service: A cloud service offering must be able to accurately measure usage, with metrics such as usage time, bandwidth and data usage. Without the ability to measure its services, a provider can’t charge the customers on a usage basis.


Although virtualization per se isn’t required in a cloud environment, most cloud based environments do rely on virtualization to lower the costs involved in ramping up capacity.

Although it’s really not a part of the NIST definition of what a cloud computing environment is, I think it’s fair to add resiliency and high availability as primary requirements of a cloud based computational model.


There are multiple cloud service deployment models, as explained here.

  • Public: the public deployment model is what’s popularly associated with the term cloud computing. In this model, an external cloud service provider provides, manages, and supports the entire cloud infrastructure. Customers are responsible for the applications that run on the cloud infrastructure. Access to a public cloud is usually through the internet.

  • Private: a private cloud is where an organization maintains and supports the infrastructure internally, along with the software and applications that service the end users. You access these clouds mostly through a LAN, WAN or a VPN.

  • Hybrid: this model really means that there’s more than one cloud model that’s used, with linkages among the various clouds.

  • Community: a community cloud is shared among members of a group of organizations, who wish to restrict access to the cloud.


There are three basic cloud service models:

  • Infrastructure as a Service (IaaS): this service provides infrastructure such as servers, VMs, storage and the network. You basically get a “bare bones” managed cloud, on top of which you can create your own services. The provider spins up the servers for you on demand with a preconfigured configuration. Often it takes just minutes to get new capacity configured per your specifications. Amazon AWS, Rackspace, Microsoft Azure and Google Compute Engine are good examples of IaaS based systems. Note that you’ll still be responsible for the installation and management of the OS and the network, etc.

  • Platform as a Service (PaaS): this model offers both the infrastructure as well as the software that goes on the servers, such as the OS itself, databases, web servers etc. As such, PaaS lies somewhere in the middle of IaaS and SaaS. With a PaaS cloud service such as OpenShift (Red Hat’s PaaS system), you can make the platform spin up the environment along with the components needed for running an application, through a few commands. Administrators will still be responsible for maintaining the operating system and managing the network, as well as performing all the other system administration tasks.

  • Software as a Service (SaaS): this is the full Monte, as the services provide complete application, infrastructure, platform and data services. This is the most common cloud service model. You just sign up for the service and that’s it! SaaS offers the least amount of customization of the three cloud service models. Salesforce and Gmail are good examples of the SaaS model.

Let me make sure and point out that cloud computing isn’t an unmixed blessing for everybody. Cloud computing comes with its own shortcomings, chief of which are concerns such as the inability of service providers to offer true SLAs (service level agreements), data integration between private and public cloud based data, the inability to audit the provider’s systems or applications, the lack of customization, and inherent security issues due to the multi tenancy model of public clouds.

Amazon Web Services (AWS) is a set of services that enable you to get on the cloud with minimal effort and pain. I discuss AWS in detail in Chapter 10.

Even when you aren’t using a cloud computing vendor such as AWS, Microsoft Azure or Google App engine, you’re likely to be virtualizing your on-premise infrastructure through private clouds using something like OpenStack.

OpenStack is Linux based software that provides an orchestration layer for building a cloud data center. It provides a provisioning portal that allows you to provision and manage servers, storage, and networks. OpenStack is a popular open source platform that enables you to build an IaaS cloud (Infrastructure as a Service). OpenStack is designed to meet the needs of both private and public cloud providers and can support small to very large sized environments.

The Rise of NoSQL Databases

Traditional infrastructures and applications mostly relied (and a lot of them still do) on relational databases such as Oracle, MySQL, PostgreSQL and Microsoft SQL Server for running both online transaction processing systems, as well as for data warehouses and data marts. In those environments, it was always useful for the systems administrator to know something about relational databases.

In large environments, there are dedicated database administrators (DBAs) who manage these databases (and these databases do require a lot of care and feeding!), so sysadmins usually got into the picture during installation time and when there was an intractable server, storage, or network related performance issue that was beyond the typical DBA’s skill set.

Two of the most common relational databases that are in use today on the internet are MySQL and PostgreSQL. In addition to these traditional databases, modern system administrators must also be comfortable working with NoSQL databases.

While relational databases are adept at handling simple tabular structures like spreadsheets, there’s other data for which relational models prove inadequate. Complex data such as geo-spatial data, engineering and medical parts data, or molecular modeling, are complex. They use multiple levels of nesting, and complex data models. Hence, the standard row-column based simple two-dimensional structure doesn’t’ suit these types of data very well.

In cases such as these, NoSQL database are a good option, since you can easily represent multi-level nesting and hierarchies with the JavaScript Object Notation (JSO) format used by many NoSQL databases.

NoSQL databases have become common over the past decade, owing to the need for handling document storage and fast clustered reading. Following are the hallmarks of NoSQL databases that set them apart from the traditional relational databases:

  • Abandoning of the relational model as a way to organize data

  • Less stress and focus on concurrency issues than traditional database systems

  • Geared to running well on distributed clusters (making them suitable for very large databases through techniques such as sharding)

  • Built-in replication capabilities

  • Schema-less data models

  • Ability to handle very large data sizes

  • An open-source model

There are several types of NoSQL databases:

  • Key-value stores: These are the simplest types of NoSQL databases in many ways. The database stores values as blobs and links them to the keys. A key-value store always uses primary-key access, thus providing fast access and easy scalability. Redis, Riak and Amazon DynamoDB are examples of key-value stores.

  • Document stores: The document databases store and retrieve documents in various formats such as XML and JSON. These databases store documents in the value part of the key-value store. MongoDB is a popular document database.

  • Column-family stores: Store data with keys mapped to values and values grouped into column families. A column family is a map of data. Cassandra, Apache Hbase and Amazon SimpleDB are good examples.

  • Graph databases: These databases store entities and the relationships between the entities. The entities are called nodes and the relations are known as edges. The nodes are organized by relationships that enable you to see the patterns among the nodes. Neo4J is a well-known graph database.

Today, it’s not uncommon for organizations, especially small and medium sized companies, to expect their system administrators to help manage the NoSQL databases, instead of looking up to a dedicated DBA to manage the databases. For one thing, the new genre of NoSQL databases requires far less expert database skills (such as SQL, relational database modeling etc.).

Furthermore, companies use several types of databases for specialized purposes and most organizations don’t require a dedicated DBA for each of these databases. So, managing these databases, or at the least, worrying about their uptime and performance is increasingly falling into the lap of the system administrator.

Chapter 3 and 11 explains the key concepts behind the NoSQL databases such as MongoDB, and Cassandra.

System administrators have long used shell scripts built with the Python and Perl languages to manage even large infrastructures. So, what do these tools offer that the old-fashioned scripting methods don’t? Following is a brief list of the benefits offered by the new-fangled configuration tools.

  • It’s easier for a system administrator to learn a DSL rather than full-fledged scripting languages such as Python.

  • The tools make it easier for teams to understand the configuration and maintain it over time.

  • One team can configure the servers and other teams can apply the configuration without a problem.

  • Since you can easily reproduce servers, change management becomes easy – you can perform multiple minor changes frequently

  • All CM tools contain auditing capabilities. Thus, the tools help you keep up with work performed by groups of system administrators and developments – manual configuration simply can’t track and audit the changes made by the individual members of the various teams. Thus, visibility is a key benefit of CM systems.

Let me illustrate the dramatic difference between the old approach and the configuration tools methodology. Following is a Chef recipe that shows how to change permissions on a directory:

directory '/var/www/repo' do
mode '0755'
owner 'www'
group 'www'

If you were to do the same thing using a scripting language, your script would need to incorporate logic for several things, such as checking for the existence of the directory, and confirming that the owner and group information is correct, etc. With Chef, you don’t need to add this logic since it knows what to do in each situation.

This doesn’t mean that these things cease to be important, or that you don’t need to think about them Quite the contrary in fact. The logic that’s provided by the Chef provider ensures that regardless of the operating system and the its version, the directory is set to the correct permissions, and the work is performed only if necessary.

Modern Scripting Languages

Often system administrators wonder what programming languages they ought to learn. A good strategy would be for administrators to become adept at a couple, or even one scripting language, say Python or Ruby.

In addition, system administrators should learn new languages and frameworks because in today’s world you’ll be dealing with numerous open source tools. To work efficiently with these tools and adapt them to your environment and even make enhancements, you need to know how these tools are constructed.

There’s no one scripting language that can serve as a do-it-all language. Some languages are better for handling text and data while others maybe be more efficacious at working with cloud vendor APIs with their specialized libraries.

Puppet and Chef both use the Ruby programming language and therefore, if you’re planning to spend any serious time with these types of tools, it’s smart to start liking and learning Ruby, that is, if you don’t already love it. Ruby is fast becoming the default scripting language for modern system administrators – it’s taken over Perl and Python as the scripting language of the future.

Traditionally system administrators used a heavy dose of shell scripting with Bash, Awk and Sed, to perform routine Linux administration operations such as searching for files, removing old log files and managing users, etc. While all the traditional scripting skills continue to be useful, in the modern Linux administration world, a key reason for using a programming language is to create tools.

Chef cookbooks are generally written in RUBY (with Chef DSL, which is Ruby). Underneath, Chef uses erlang along with some Ruby. Puppet uses Puppet DSL, and underneath, it is Ruby. Ansible and Saltstack on the other hand, are Python based. So, a good grounding in both Ruby and Python is very useful.

Note: New technologies such as Linux Containers, Docker. Packer and etcd all use GO for their internal tooling.

Microservices, Service Registration and Service Discovery

Microservices are a fast growing architecture for distributed applications. Microservice architectures involve the breaking up of an application into tiny service apps, which each of the apps performing a specialized task.

You build up complex applications by using microservices as your building blocks. You can deploy microservices independent of each other since the services are loosely coupled. Each microservice performs a single task that represents a small part of the business capability. Since you can scale the smaller blocks per your requirements, fast-growing sites find it extremely convenient to adopt a microservice based architecture. Microservices help deliver software faster and also adapt to changes using newer technologies.

Right now, many organizations use Node.js, a cross-platform runtime environment for developing server-side web applications, to create the tiny web services that comprise the heart of a microservice based architecture. Following are some of the common tools and frameworks used as part of creating and managing microservices.

  • Node.js

  • Zookeeper (or etcd or consul)

  • Serf

  • Skydns/skydock


Some folks use the two-pizza rule to as a guideline for determining if a service really qualifies as a microservice. The Two-Pizza rule states that if you can’t feed the team that’s building the microservice with two pizzas, the microservice is too big!

Microservices are just what they sound like – they’re small services that are geared towards a narrow functionality. Instead of writing one long monolithic piece of code, in a microservice based approach, you try to create independent services that work together to achieve specific objectives.

If you’re finding that it’s taking a very long time and a large amount of effort to maintain code for an app, the app may very well be ready for breaking up into a set of smaller services.

As to how small a microservice ought to be, there’s no hard and fast rule. Some define a microservice in terms of the time it takes to write the application – so, for example, one may say that that any service that you can write within two weeks qualifies as a microservice.


A PaaS such as Cloud Foundry is ideal for supporting microservice architectures, since it includes several types of databases, message brokers, data caches and other services that you need to manage.

The key thing is that the services are independent of each other and are completely autonomous, and therefore, changes in the individual services won’t affect the entire application, at least in theory. Each of the services collaborates with other services through an application programming interface (API). Microservices in general have the following features:

  • Loosely coupled architecture: microservices are deployable on their own without a need to coordinate the deployment of a service with other microservices

  • Bounded Context: any one microservice is unaware of the underlying implementation of the other microservices

  • Language Neutrality: all microservices don’t have to be written in the same programming language. You write each microservice in the language that’s best for it. Language neutral APIs such as REST are used for communications among the microservices

  • Scalability: you can scale up an application that’s bottlenecked without having to scale the entire application

One may be wondering as to the difference between microservices and a service-oriented architecture (SOA). Both SOA and microservices are service architectures that deal with distributed set of services communicating over the network. The focus of SOA is mostly on reusability and discovery, whereas the microservices focus is on replacing a monolithic application with an agile, and more effective incremental functionally approach. I think microservices are really the next stage up for the principles behind SOA. SOA hasn’t always worked well in practice due to various reasons. Microservices are a more practical and realistic approach to achieving the same goals as SOA.

Benefits of Microservices

Following is a brief list of the key benefits offered by microservice based architectures.

  • Heterogeneous Technologies: you don’t have to settle for a common and mediocre technology for an entire system. Microservices let you choose the best of breed approach to technologies for various functionalities provided by an application.

  • Speed: it’s much faster to rewrite or modify a tiny part of the application rather than make sure that the entire application is modified, tested and approved for production deployment.

  • Organization: Since separate teams work on different microservices, they avoid much of the friction of larger groups trying to work together.

  • Resilient systems: you can easily isolate the impact of the failure of small service components.

  • Scaling: you can scale only some services that actually need scaling and leave the other parts of the system alone.

  • Easy Application Modifications and Upgrades: with microservices, you can easily replace and remove older services since rewriting new ones is almost trivial.

Both service registration and service discovery play a key role in the management of microservices, so let me briefly explain these two concepts in the following sections.

Service Discovery

Communication among the microservice applications is a major issue when you use the microservice architecture. Server Registration and Service Discovery are the solutions for this issue. As the number of microservices proliferates, both you and the users need to know where to find the services. Service discovery is the way to keep track of microservices so you can monitor them and know where to find them. There are multiple approaches to discovering services, the simplest strategy being the use of the Domain Name Service (DNS) to identify services. You can simply associate a service with the IP address of the host that runs the service. Of course this means that you’d need to update the DNS entries when you deploy services. You could also use a different DNS server for different environments. However, in a highly dynamic microservice environment, hosts keep changing often, so you’ll be stuck with having to frequently update the DNS entries.

Service Registration

Service registration is the identification and tracking of services by having services register themselves with a central registry that offers a look up service to help you find the services. There are several options for implementing service registration:

  • Zookeeper: A coordination service for distributed environments that offers a hierarchical namespace where you can store your service information.

  • Etcd: Aa distributed key-value store that provides shared configuration and service discovery for clusters.

  • Consul: Is more sophisticated than Zookeeper when it comes to service discovery capabilities. It also exposes an HTTP interface for discovering and registering services and also provides a DNS server

Chapter 5 discusses service discovery and service registration in more detail, in the context of using Docker containers.

Modern Infrastructure security concerns

While traditional security strategies such as Linux server hardening and network security are important as always there are several other security concerns that have come to the forefront in the recent years.

Sysadmins need to learn how to secure containers (Docker,cloud,and big data. Web application security is also an important concern today, with so much of the work being handled by web applications.

Following are the key topics that I deal with in Chapter 15, which I devote to a discussion of security concerns that pertain to modern developments in IT.

  • DevOps Security

  • Infrastructure as Code and security

  • Container (Docker) security

  • Zero Trust Networking

  • Security in the cloud (how Amazon Web Services secures its infrastructure and your data)

  • Big Data (Hadoop) security

  • Web Application Security, including web application security testing and the OWASP security controls)

Performance Tuning, and Site Reliability Engineering

Linux administrator are familiar with the standard Linux performance tools that help monitor and tune performance. However, there are new performance concerns today, such as performance in virtualized and cloud environments.

Site reliability engineering involves performing work that historically was the domain of administrators working on operations teams, but doing the work with the help of engineers whose primary expertise is in coding. A key goal of SRE is the use of automation to cut down on human intervention to resolve issues.

Google’s SRE teams use software engineers to create systems that seek to perform the work traditionally performed by sysadmins. In the words of Google’s Benjamin Treynor Sloss, “SRE is what happens when you ask a software engineer to design an operations team”

I discuss the following key areas in Chapter 16:

  • Learning from failure and Chaos engineering

  • Site reliability engineering (SRE) and high availability

  • Effective use of caching to scale

  • Understanding and Tuning Java Performance (the JVM as well as the Java Applications)

  • Monitoring Linux, Tuning the Kernel, and Tracing in a Linux System

  • Tuning the network

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required