Chapter 4. Container Isolation

This is the chapter in which you’ll find out how containers really work! This will be essential to understanding the extent to which containers are isolated from each other and from the host. You will be able to assess for yourself the strength of the security boundary that surrounds a container.

As you’ll know if you have ever run docker exec <image> bash, a container looks a lot like a virtual machine from the inside. If you have shell access to a container and run ps, you can see only the processes that are running inside it. The container has its own network stack, and it seems to have its own filesystem with a root directory that bears no relation to root on the host. You can run containers with limited resources, such as a restricted amount of memory or a fraction of the available CPUs. This all happens using the Linux features that we’re going to delve into in this chapter.

However much they might superficially resemble each other, it’s important to realize that containers aren’t virtual machines, and in Chapter 5 we’ll take a look at the differences between these two types of isolation. In my experience, really understanding and being able to contrast the two is absolutely key to grasping the extent to which traditional security measures can be effective in containers, and to identifying where container-specific tooling is necessary.

You’ll see how containers are built out of Linux constructs such as namespaces and chroot, along with cgroups, which were covered in Chapter 3. With an understanding of these constructs under your belt, you’ll have a feeling for how well protected your applications are when they run inside containers.

Although the general concepts of these constructs are fairly straightforward, the way they work together with other features of the Linux kernel can be complex. Container escape vulnerabilities (for example, CVE-2019-5736, a serious vulnerability discovered in both runc and LXC) have been based on subtleties in the way that namespaces, capabilities, and filesystems interact.

Linux Namespaces

If cgroups control the resources that a process can use, namespaces control what it can see. By putting a process in a namespace, you can restrict the resources that are visible to that process.

The origins of namespaces date back to the Plan 9 operating system. At the time, most operating systems had a single “name space” of files. Unix systems allowed the mounting of filesystems, but they would all be mounted into the same system-wide view of all filenames. In Plan 9, each process was part of a process group that had its own “name space” abstraction, the hierarchy of files (and file-like objects) that this group of processes could see. Each process group could mount its own set of filesystems without seeing each other.

The first namespace was introduced to the Linux kernel in version 2.4.19 back in 2002. This was the mount namespace, and it followed similar functionality to that in Plan 9. Nowadays there are several different kinds of namespace supported by Linux:

  • Unix Timesharing System (UTS)—this sounds complicated, but to all intents and purposes this namespace is really just about the hostname and domain names for the system that a process is aware of.

  • Process IDs

  • Mount points

  • Network

  • User and group IDs

  • Inter-process communications (IPC)

  • Control groups (cgroups)

It’s possible that more resources will be namespaced in future revisions of the Linux kernel. For example, there have been discussions about having a namespace for time.

A process is always in exactly one namespace of each type. When you start a Linux system it has a single namespace of each type, but as you’ll see, you can create additional namespaces and assign processes into them. You can easily see the namespaces on your machine using the lsns command:

vagrant@myhost:~$ lsns
        NS TYPE   NPROCS   PID USER    COMMAND
4026531835 cgroup      3 28459 vagrant /lib/systemd/systemd --user
4026531836 pid         3 28459 vagrant /lib/systemd/systemd --user
4026531837 user        3 28459 vagrant /lib/systemd/systemd --user
4026531838 uts         3 28459 vagrant /lib/systemd/systemd --user
4026531839 ipc         3 28459 vagrant /lib/systemd/systemd --user
4026531840 mnt         3 28459 vagrant /lib/systemd/systemd --user
4026531992 net         3 28459 vagrant /lib/systemd/systemd --user

This looks nice and neat, and there is one namespace for each of the types I mentioned previously. Sadly, this is an incomplete picture! The man page for lsns tells us that it “reads information directly from the /proc filesystem and for non-root users it may return incomplete information.” Let’s see what you get when you run as root:

vagrant@myhost:~$ sudo lsns
        NS TYPE   NPROCS   PID USER            COMMAND
4026531835 cgroup     93     1 root            /sbin/init
4026531836 pid        93     1 root            /sbin/init
4026531837 user       93     1 root            /sbin/init
4026531838 uts        93     1 root            /sbin/init
4026531839 ipc        93     1 root            /sbin/init
4026531840 mnt        89     1 root            /sbin/init
4026531860 mnt         1    15 root            kdevtmpfs
4026531992 net        93     1 root            /sbin/init
4026532170 mnt         1 14040 root            /lib/systemd/systemd-udevd
4026532171 mnt         1   451 systemd-network /lib/systemd/systemd-networkd
4026532190 mnt         1   617 systemd-resolve /lib/systemd/systemd-resolved

The root user can see some additional mount namespaces, and there are a lot more processes visible to root than were visible to the non-root user. The reason to show you this is to note that when we are using lsns, we should run as root (or use sudo) to get the complete picture.

Let’s explore how you can use namespaces to create something that behaves like what we call a “container.”

Note

The examples in this chapter use Linux shell commands to create a container. If you would like to try creating a container using the Go programming language, you will find instructions at https://github.com/lizrice/containers-from-scratch.

Isolating the Hostname

Let’s start with the namespace for the Unix Timesharing System (UTS). As mentioned previously, this covers the hostname and domain names. By putting a process in its own UTS namespace, you can change the hostname for this process independently of the hostname of the machine or virtual machine on which it’s running.

If you open a terminal on Linux, you can see the hostname:

vagrant@myhost:~$ hostname
myhost

Most (perhaps all?) container systems give each container a random ID. By default this ID is used as the hostname. You can see this by running a container and getting shell access. For example, in Docker you could do the following:

vagrant@myhost:~$ docker run --rm -it --name hello ubuntu bash
root@cdf75e7a6c50:/$ hostname
cdf75e7a6c50

Incidentally, you can see in this example that even if you give the container a name in Docker (here I specified --name hello), that name isn’t used for the hostname of the container.

The container can have its own hostname because Docker created it with its own UTS namespace. You can explore the same thing by using the unshare command to create a process that has a UTS namespace of its own.

As it’s described on the man page (seen by running man unshare), unshare lets you “run a program with some namespaces unshared from the parent.” Let’s dig a little deeper into that description. When you “run a program,” the kernel creates a new process and executes the program in it. This is done from the context of a running process—the parent—and the new process will be referred to as the child. The word “unshare” means that, rather than sharing namespaces of its parent, the child is going to be given its own.

Let’s give it a try. You need to have root privileges to do this, hence the sudo at the start of the line:

vagrant@myhost:~$ sudo unshare --uts sh
$ hostname
myhost
$ hostname experiment
$ hostname
experiment
$ exit
vagrant@myhost:~$ hostname
myhost

This runs a sh shell in a new process that has a new UTS namespace. Any programs you run inside the shell will inherit its namespaces. When you run the hostname command, it executes in the new UTS namespace that has been isolated from that of the host machine.

If you were to open another terminal window to the same host before the exit, you could confirm that the hostname hasn’t changed for the whole (virtual) machine. You can change the hostname on the host without affecting the hostname that the namespaced process is aware of, and vice versa.

This is a key component of the way containers work. Namespaces give them a set of resources (in this case the hostname) that are independent of the host machine, and of other containers. But we are still talking about a process that is being run by the same Linux kernel. This has security implications that I’ll discuss later in the chapter. For now, let’s look at another example of a namespace by seeing how you can give a container its own view of running processes.

Isolating Process IDs

If you run the ps command inside a Docker container, you can see only the processes running inside that container and none of the processes running on the host:

vagrant@myhost:~$ docker run --rm -it --name hello ubuntu bash
root@cdf75e7a6c50:/$ ps -eaf
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 18:41 pts/0    00:00:00 bash
root        10     1  0 18:42 pts/0    00:00:00 ps -eaf
root@cdf75e7a6c50:/$ exit
vagrant@myhost:~$

This is achieved with the process ID namespace, which restricts the set of process IDs that are visible. Try running unshare again, but this time specifying that you want a new PID namespace with the --pid flag:

vagrant@myhost:~$ sudo unshare --pid sh
$ whoami
root
$ whoami
sh: 2: Cannot fork
$ whoami
sh: 3: Cannot fork
$ ls
sh: 4: Cannot fork
$ exit
vagrant@myhost:~$

This doesn’t seem very successful—it’s not possible to run any commands after the first whoami! But there are some interesting artifacts in this output.

The first process under sh seems to have worked OK, but every command after that fails due to an inability to fork. The error is output in the form <command>: <process ID>: <message>, and you can see that the process IDs are incrementing each time. Given the sequence, it would be reasonable to assume that the first whoami ran as process ID 1. That is a clue that the PID namespace is working in some fashion, in that the process ID numbering has restarted. But it’s pretty much useless if you can’t run more than one process!

There are clues to what the problem is in the description of the --fork flag in the man page for unshare: “Fork the specified program as a child process of unshare rather than running it directly. This is useful when creating a new pid namespace.”

You can explore this by running ps to view the process hierarchy from a second terminal window:

vagrant@myhost:~$ ps fa
  PID TTY      STAT   TIME COMMAND
...
30345 pts/0    Ss     0:00 -bash
30475 pts/0    S      0:00  \_ sudo unshare --pid sh
30476 pts/0    S      0:00      \_ sh

The sh process is not a child of unshare; it’s a child of the sudo process.

Now try the same thing with the --fork parameter:

vagrant@myhost:~$ sudo unshare --pid --fork sh
$ whoami
root
$ whoami
root

This is progress, in that you can now run more than one command before running into the “Cannot fork” error. If you look at the process hierarchy again from a second terminal, you’ll see an important difference:

vagrant@myhost:~$ ps fa
  PID TTY      STAT   TIME COMMAND
...
30345 pts/0    Ss     0:00 -bash
30470 pts/0    S      0:00  \_ sudo unshare --pid --fork sh
30471 pts/0    S      0:00      \_ unshare --pid --fork sh
30472 pts/0    S      0:00          \_ sh
...

With the --fork parameter, the sh shell is running as a child of the unshare process, and you can successfully run as many different child commands as you choose within this shell.

Given that the shell is within its own process ID namespace, the results of running ps inside it might be surprising:

vagrant@myhost:~$ sudo unshare --pid --fork sh
$ ps
  PID TTY          TIME CMD
14511 pts/0    00:00:00 sudo
14512 pts/0    00:00:00 unshare
14513 pts/0    00:00:00 sh
14515 pts/0    00:00:00 ps
$ ps -eaf
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Mar27 ?        00:00:02 /sbin/init
root         2     0  0 Mar27 ?        00:00:00 [kthreadd]
root         3     2  0 Mar27 ?        00:00:00 [ksoftirqd/0]
root         5     2  0 Mar27 ?        00:00:00 [kworker/0:0H]
...many more lines of output about processes...
$ exit
vagrant@myhost:~$

As you can see, ps is still showing all the processes on the whole host, despite running inside a new process ID namespace. If you want the ps behavior that you would see in a Docker container, it’s not sufficient just to use a new process ID namespace, and the reason for this is included in the man page for ps: “This ps works by reading the virtual files in /proc.”

Let’s take a look at the /proc directory to see what virtual files this is referring to. Your system will look similar, but not exactly the same, as it will be running a different set of processes:

vagrant@myhost:~$ ls /proc
1      14553  292    467        cmdline      modules
10     14585  3      5          consoles     mounts
1009   14586  30087  53         cpuinfo      mpt
1010   14664  30108  538        crypto       mtrr
1015   14725  30120  54         devices      net
1016   14749  30221  55         diskstats    pagetypeinfo
1017   15     30224  56         dma          partitions
1030   156    30256  57         driver       sched_debug
1034   157    30257  58         execdomains  schedstat
1037   158    30283  59         fb           scsi
1044   159    313    60         filesystems  self
1053   16     314    61         fs           slabinfo
1063   160    315    62         interrupts   softirqs
1076   161    34     63         iomem        stat
1082   17     35     64         ioports      swaps
11     18     3509   65         irq          sys
1104   19     3512   66         kallsyms     sysrq-trigger
1111   2      36     7          kcore        sysvipc
1175   20     37     72         keys         thread-self
1194   21     378    8          key-users    timer_list
12     22     385    85         kmsg         timer_stats
1207   23     392    86         kpagecgroup  tty
1211   24     399    894        kpagecount   uptime
1215   25     401    9          kpageflags   version
12426  26     403    966        loadavg      version_signature
125    263    407    acpi       locks        vmallocinfo
13     27     409    buddyinfo  mdstat       vmstat
14046  28     412    bus        meminfo      zoneinfo
14087  29     427    cgroups    misc

Every numbered directory in /proc corresponds to a process ID, and there is a lot of interesting information about a process inside its directory. For example, /proc/<pid>/exe is a symbolic link to the executable that’s being run inside this particular process, as you can see in the following example:

vagrant@myhost:~$ ps
  PID TTY          TIME CMD
28441 pts/1    00:00:00 bash
28558 pts/1    00:00:00 ps
vagrant@myhost:~$ ls /proc/28441
attr             fdinfo      numa_maps      smaps
autogroup        gid_map     oom_adj        smaps_rollup
auxv             io          oom_score      stack
cgroup           limits      oom_score_adj  stat
clear_refs       loginuid    pagemap        statm
cmdline          map_files   patch_state    status
comm             maps        personality    syscall
coredump_filter  mem         projid_map     task
cpuset           mountinfo   root           timers
cwd              mounts      sched          timerslack_ns
environ          mountstats  schedstat      uid_map
exe              net         sessionid      wchan
fd               ns          setgroups
vagrant@myhost:~$ ls -l /proc/28441/exe
lrwxrwxrwx 1 vagrant vagrant 0 Oct 10 13:32 /proc/28441/exe -> /bin/bash

Irrespective of the process ID namespace it’s running in, ps is going to look in /proc for information about running processes. In order to have ps return only the information about the processes inside the new namespace, there needs to be a separate copy of the /proc directory, where the kernel can write information about the namespaced processes. Given that /proc is a directory directly under root, this means changing the root directory.

Changing the Root Directory

From within a container, you don’t see the host’s entire filesystem; instead, you see a subset, because the root directory gets changed as the container is created.

You can change the root directory in Linux with the chroot command. This effectively moves the root directory for the current process to point to some other location within the filesystem. Once you have done a chroot command, you lose access to anything that was higher in the file hierarchy than your current root directory, since there is no way to go any higher than root within the filesystem, as illustrated in Figure 4-1.

The description in chroot’s man page reads as follows: “Run COMMAND with root directory set to NEWROOT. […] If no command is given, run ${SHELL} -i (default: /bin/sh -i).”

Changing root so a process sees only a subset of the filesystem
Figure 4-1. Changing root so a process only sees a subset of the filesystem

From this you can see that chroot doesn’t just change the directory, but also runs a command, falling back to running a shell if you don’t specify a different command.

Create a new directory and try to chroot into it:

vagrant@myhost:~$ mkdir new_root
vagrant@myhost:~$ sudo chroot new_root
chroot: failed to run command ‘/bin/bash’: No such file or directory
vagrant@myhost:~$ sudo chroot new_root ls
chroot: failed to run command ‘ls’: No such file or directory

This doesn’t work! The problem is that once you are inside the new root directory, there is no bin directory inside this root, so it’s impossible to run the /bin/bash shell. Similarly, if you try to run the ls command, it’s not there. You’ll need the files for any commands you want to run to be available within the new root. This is exactly what happens in a “real” container: the container is instantiated from a container image, which encapsulates the filesystem that the container sees. If an executable isn’t present within that filesystem, the container won’t be able to find and run it.

Why not try running Alpine Linux within your container? Alpine is a fairly minimal Linux distribution designed for containers. You’ll need to start by downloading the filesystem:

vagrant@myhost:~$ mkdir alpine
vagrant@myhost:~$ cd alpine
vagrant@myhost:~/alpine$ curl -o alpine.tar.gz http://dl-cdn.alpinelinux.org/
alpine/v3.10/releases/x86_64/alpine-minirootfs-3.10.0-x86_64.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2647k  100 2647k    0     0  16.6M      0 --:--:-- --:--:-- --:--:-- 16.6M
vagrant@myhost:~/alpine$ tar xvf alpine.tar.gz

At this point you have a copy of the Alpine filesystem inside the alpine directory you created. Remove the compressed version and move back to the parent directory:

vagrant@myhost:~/alpine$ rm alpine.tar.gz
vagrant@myhost:~/alpine$ cd ..

You can explore the contents of the filesystem with ls alpine to see that it looks like the root of a Linux filesystem with directories such as bin, lib, var, tmp, and so on.

Now that you have the Alpine distribution unpacked, you can use chroot to move into the alpine directory, provided you supply a command that exists within that directory’s hierarchy.

It’s slightly more subtle than that, because the executable has to be in the new process’s path. This process inherits the parent’s environment, including the PATH environment variable. The bin directory within alpine has become /bin for the new process, and assuming that your regular path includes /bin, you can pick up the ls executable from that directory without specifying its path explicitly:

vagrant@myhost:~$ sudo chroot alpine ls
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr
vagrant@myhost:~$

Notice that it is only the child process (in this example, the process that ran ls) that gets the new root directory. When that process finishes, control returns to the parent process. If you run a shell as the child process, it won’t complete immediately, so that makes it easier to see the effects of changing the root directory:

vagrant@myhost:~$ sudo chroot alpine sh
/ $ ls
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr
/ $ whoami
root
/ $ exit
vagrant@myhost:~$

If you try to run the bash shell, it won’t work. This is because the Alpine distribution doesn’t include it, so it’s not present inside the new root directory. If you tried the same thing with the filesystem of a distribution like Ubuntu, which does include bash, it would work.

To summarize, chroot literally “changes the root” for a process. After changing the root, the process (and its children) will be able to access only the files and directories that are lower in the hierarchy than the new root directory.

Note

In addition to chroot, there is a system call called pivot_root. For the purposes of this chapter, whether chroot or pivot_root is used is an implementation detail; the key point is that a container needs to have its own root directory. I have used chroot in these examples because it is slightly simpler and more familiar to many people.

There are security advantages to using pivot_root over chroot, so in practice you should find the former if you look at the source code of a container runtime implementation. The main difference is that pivot_root takes advantage of the mount namespace; the old root is no longer mounted and is therefore no longer accessible within that mount namespace. The chroot system call doesn’t take this approach, leaving the old root accessible via mount points.

You have now seen how a container can be given its own root filesystem. I’ll discuss this further in Chapter 6, but right now let’s see how having its own root filesystem allows the kernel to show a container just a restricted view of namespaced resources.

Combine Namespacing and Changing the Root

So far you have seen namespacing and changing the root as two separate things, but you can combine the two by running chroot in a new namespace:

me@myhost:~$ sudo unshare --pid --fork chroot alpine sh
/ $ ls
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr

If you recall from earlier in this chapter (see “Isolating Process IDs”), giving the container its own root directory allows it to create a /proc directory for the container that’s independent of /proc on the host. For this to be populated with process information, you will need to mount it as a pseudofilesystem of type proc. With the combination of a process ID namespace and an independent /proc directory, ps will now show just the processes that are inside the process ID namespace:

/ $ mount -t proc proc proc
/ $ ps
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    6 root      0:00 ps
/ $ exit
vagrant@myhost:~$

Success! It has been more complex than isolating the container’s hostname, but through the combination of creating a process ID namespace, changing the root directory, and mounting a pseudofilesystem to handle process information, you can limit a container so that it has a view only of its own processes.

There are more namespaces left to explore. Let’s see the mount namespace next.

Mount Namespace

Typically you don’t want a container to have all the same filesystem mounts as its host. Giving the container its own mount namespace achieves this separation.

Here’s an example that creates a simple bind mount for a process with its own mount namespace:

vagrant@myhost:~$ sudo unshare --mount sh
$ mkdir source
$ touch source/HELLO
$ ls source
HELLO
$ mkdir target
$ ls target
$ mount --bind source target
$ ls target
HELLO

Once the bind mount is in place, the contents of the source directory are also available in target. If you look at all the mounts from within this process, there will probably be a lot of them, but the following command finds the target you created if you followed the preceding example:

$ findmnt target
TARGET    SOURCE                FSTYPE OPTIONS
/home/vagrant/target
          /dev/mapper/vagrant--vg-root[/home/vagrant/source]
                                ext4   rw,relatime,errors=remount-ro,data=ordered

From the host’s perspective, this isn’t visible, which you can prove by running the same command from another terminal window and confirming that it doesn’t return anything.

Try running findmnt from within the mount namespace again, but this time without any parameters, and you will get a long list. You might be thinking that it seems wrong for a container to be able to see all the mounts on the host. This is a very similar situation to what you saw with the process ID namespace: the kernel uses the /proc/<PID>/mounts directory to communicate information about mount points for each process. If you create a process with its own mount namespace but it is using the host’s /proc directory, you’ll find that its /proc/<PID>/mounts file includes all the preexisting host mounts. (You can simply cat this file to get a list of mounts.)

To get a fully isolated set of mounts for the containerized process, you will need to combine creating a new mount namespace with a new root filesystem and a new proc mount, like this:

vagrant@myhost:~$ sudo unshare --mount chroot alpine sh
/ $ mount -t proc proc proc
/ $ mount
proc on /proc type proc (rw,relatime)
/ $ mkdir source
/ $ touch source/HELLO
/ $ mkdir target
/ $ mount --bind source target
/ $ mount
proc on /proc type proc (rw,relatime)
/dev/sda1 on /target type ext4 (rw,relatime,data=ordered)

Alpine Linux doesn’t come with the findmnt command, so this example uses mount with no parameters to generate the list of mounts. (If you are cynical about this change, try the earlier example with mount instead of findmnt to check that you get the same results.)

You may be familiar with the concept of mounting host directories into a container using docker run -v <host directory>:<container directory> .... To achieve this, after the root filesystem has been put in place for the container, the target container directory is created and then the source host directory gets bind mounted into that target. Because each container has its own mount namespace, host directories mounted like this are not visible from other containers.

Note

If you create a mount that is visible to the host, it won’t automatically get cleaned up when your “container” process terminates. You will need to destroy it using umount. This also applies to the /proc pseudofilesystems. They won’t do any particular harm, but if you like to keep things tidy, you can remove them with umount proc. The system won’t let you unmount the final /proc used by the host.

Network Namespace

The network namespace allows a container to have its own view of network interfaces and routing tables. When you create a process with its own network namespace, you can see it with lsns:

vagrant@myhost:~$ sudo lsns -t net
        NS TYPE NPROCS PID USER    NETNSID NSFS COMMAND
4026531992 net      93   1 root unassigned      /sbin/init
vagrant@myhost:~$ sudo unshare --net bash
root@myhost:~$ lsns -t net
        NS TYPE NPROCS   PID USER    NETNSID NSFS COMMAND
4026531992 net      92     1 root unassigned      /sbin/init
4026532192 net       2 28586 root unassigned      bash
Note

You might come across the ip netns command, but that is not much use to us here. Using unshare --net creates an anonymous network namespace, and anonymous namespaces don’t appear in the output from ip netns list.

When you put a process into its own network namespace, it starts with just the loopback interface:

vagrant@myhost:~$ sudo unshare --net bash
root@myhost:~$ ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

With nothing but a loopback interface, your container won’t be able to communicate. To give it a path to the outside world, you create a virtual Ethernet interface—or more strictly, a pair of virtual Ethernet interfaces. These act as if they were the two ends of a metaphorical cable connecting your container namespace to the default network namespace.

In a second terminal window, as root, you can create a virtual Ethernet pair by specifying the anonymous namespaces associated with their process IDs, like this:

root@myhost:~$ ip link add ve1 netns 28586 type veth peer name ve2 netns 1
  • ip link add indicates that you want to add a link.

  • ve1 is the name of one “end” of the virtual Ethernet “cable.”

  • netns 28586 says that this end is “plugged in” to the network namespace associated with process ID 28586 (which is shown in the output from lsns -t net in the example at the start of this section).

  • type veth shows that this a virtual Ethernet pair.

  • peer name ve2 gives the name of the other end of the “cable.”

  • netns 1 specifies that this second end is “plugged in” to the network namespace associated with process ID 1.

The ve1 virtual Ethernet interface is now visible from inside the “container” process:

root@myhost:~$ ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ve1@if3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group ...
    link/ether 7a:8a:3f:ba:61:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0

The link is in “DOWN” state and needs to be brought up before it’s any use. Both ends of the connection need to be brought up.

Bring up the ve2 end on the host:

root@myhost:~$ ip link set ve2 up

And once you bring up the ve1 end in the container, the link should move to “UP” state:

root@myhost:~$ ip link set ve1 up
root@myhost:~$ ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ve1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP ...
    link/ether 7a:8a:3f:ba:61:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::788a:3fff:feba:612c/64 scope link
       valid_lft forever preferred_lft forever

To send IP traffic, there needs to an IP address associated with its interface. In the container:

root@myhost:~$ ip addr add 192.168.1.100/24 dev ve1

And on the host:

root@myhost:~$ ip addr add 192.168.1.200/24 dev ve1

This will also have the effect of adding an IP route into the routing table in the container:

root@myhost:~$ ip route
192.168.1.0/24 dev ve1 proto kernel scope link src 192.168.1.100

As mentioned at the start of this section, the network namespace isolates both the interfaces and the routing table, so this routing information is independent of the IP routing table on the host. At this point the container can send traffic only to 192.168.1.0/24 addresses. You can test this with a ping from within the container to the remote end:

root@myhost:~$ ping 192.168.1.100
PING 192.168.1.100 (192.168.1.100) 56(84) bytes of data.
64 bytes from 192.168.1.100: icmp_seq=1 ttl=64 time=0.355 ms
64 bytes from 192.168.1.100: icmp_seq=2 ttl=64 time=0.035 ms
^C

We will dig further into networking and container network security in Chapter 10.

User Namespace

The user namespace allows processes to have their own view of user and group IDs. Much like process IDs, the users and groups still exist on the host, but they can have different IDs. The main benefit of this is that you can map the root ID of 0 within a container to some other non-root identity on the host. This is a huge advantage from a security perspective, since it allows software to run as root inside a container, but an attacker who escapes from the container to the host will have a non-root, unprivileged identity. As you’ll see in Chapter 9, it’s not hard to misconfigure a container to make it easy escape to the host. With user namespaces, you’re not just one false move away from host takeover.

Note

As of this writing, user namespaces are not in particularly common use yet. This feature is not turned on by default in Docker (see “User Namespace Restrictions in Docker”), and it is not supported at all in Kubernetes, though it has been under discussion.

Generally speaking, you need to be root to create new namespaces, which is why the Docker daemon runs as root, but the user namespace is an exception:

vagrant@myhost:~$ unshare --user bash
nobody@myhost:~$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
nobody@myhost:~$ echo $$
31196

Inside the new user namespace the user has the nobody ID. You need to put in place a mapping between user IDs inside and outside the namespace, as shown in Figure 4-2.

Mapping a non-root user on the host to root in a container
Figure 4-2. Mapping a non-root user on the host to root in a container

This mapping exists in /proc/<pid>/uid_map, which you can edit as root (on the host). There are three fields in this file:

  • The lowest ID to map from the child process’s perspective

  • The lowest corresponding ID that this should map to on the host

  • The number of IDs to be mapped

As an example, on my machine, the vagrant user has ID 1000. In order to have vagrant get assigned the root ID of 0 inside the child process, the first two fields are 0 and 1000. The last field can be 1 if you want to map only one ID (which may well be the case if you want only one user inside the container). Here’s the command I used to set up that mapping:

vagrant@myhost:~$ sudo echo '0 1000 1' > /proc/31196/uid_map

Immediately, inside its user namespace, the process has taken on the root identity. Don’t be put off by the fact that the bash prompt still says “nobody”; this doesn’t get updated unless you rerun the scripts that get run when you start a new shell (e.g., ~/.bash_profile):

nobody@myhost:~$ id
uid=0(root) gid=65534(nogroup) groups=65534(nogroup)

A similar mapping process is used to map the group(s) used inside the child process.

This process is now running with a large set of capabilities:

nobody@myhost:~$ capsh --print | grep Current
Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,
cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,
cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,
cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,
cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,
cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,
cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
cap_wake_alarm,cap_block_suspend,cap_audit_read+ep

As you saw in Chapter 2, capabilities grant the process various permissions. When you create a new user namespace, the kernel gives the process all these capabilities so that the pseudo root user inside the namespace is allowed to create other namespaces, set up networking, and so on, fulfilling everything else required to make it a real container.

In fact, if you simultaneously create a process with several new namespaces, the user namespace will be created first so that you have the full capability set that permits you to create other namespaces:

vagrant@myhost:~$ unshare --uts bash
unshare: unshare failed: Operation not permitted
vagrant@myhost:~$ unshare --uts --user bash
nobody@myhost:~$

User namespaces allow an unprivileged user to effectively become root within the containerized process. This allows a normal user to run containers using a concept called rootless containers, which we will cover in Chapter 9.

The general consensus is that user namespaces are a security benefit because fewer containers need to run as “real” root (that is, root from the erspective). However, there have been a few vulnerabilities (for example, CVE-2018-18955) directly related to privileges being incorrectly transformed while transitioning to or from a user namespace. The Linux kernel is a complex piece of software, and you should expect that people will find problems in it from time to time.

User Namespace Restrictions in Docker

You can enable the use of user namespaces in Docker, but it’s not turned on by default because it is incompatible with a few things that Docker users might want to do.

The following will also affect you if you use user namespaces with other container runtimes:

  • User namespaces are incompatible with sharing a process ID or network namespace with the host.

  • Even if the process is running as root inside the container, it doesn’t really have full root privileges. It doesn’t, for example, have CAP_NET_BIND_SERVICE, so it can’t bind to a low-numbered port. (See Chapter 2 for more information about Linux capabilities.)

  • When the containerized process interacts with a file, it will need appropriate permissions (for example, write access in order to modify the file). If the file is mounted from the host, it is the effective user ID on the host that matters.

    This is a good thing in terms of protecting the host files from unauthorized access from within a container, but it can be confusing if, say, what appears to be root inside the container is not permitted to modify a file.

Inter-process Communications Namespace

In Linux it’s possible to communicate between different processes by giving them access to a shared range of memory, or by using a shared message queue. The two processes need to be members of the same inter-process communications (IPC) namespace for them to have access to the same set of identifiers for these mechanisms.

Generally speaking, you don’t want your containers to be able to access one another’s shared memory, so they are given their own IPC namespaces.

You can see this in action by creating a shared memory block and then viewing the current IPC status with ipcs:

$ ipcmk -M 1000
Shared memory id: 98307
$ ipcs

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root       644        80         2
0x00000000 32769      root       644        16384      2
0x00000000 65538      root       644        280        2
0xad291bee 98307      ubuntu     644        1000       0

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x000000a7 0          root       600        1

In this example, the newly created shared memory block (with its ID in the shmid column) appears as the last item in the “Shared Memory Segments” block. There are also some preexisting IPC objects that had previously been created by root.

A process with its own IPC namespace does not see any of these IPC objects:

$ sudo unshare --ipc sh
$ ipcs

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status

------ Semaphore Arrays --------
key        semid      owner      perms      nsems

Cgroup Namespace

The last of the namespaces (at least, at the time of writing this book) is the cgroup namespace. This is a little bit like a chroot for the cgroup filesystem; it stops a process from seeing the cgroup configuration higher up in the hierarchy of cgroup directories than its own cgroup.

Note

Most namespaces were added by Linux kernel version 3.8, but the cgroup namespace was added later in version 4.6. If you’re using a relatively old distribution of Linux (such as Ubuntu 16.04), you won’t have support for this feature. You can check the kernel version on your Linux host by running uname -r.

You can see the cgroup namespace in action by comparing the contents of /proc/self/cgroup outside and then inside a cgroup namespace:

vagrant@myhost:~$ cat /proc/self/cgroup
12:cpu,cpuacct:/
11:cpuset:/
10:hugetlb:/
9:blkio:/
8:memory:/user.slice/user-1000.slice/session-51.scope
7:pids:/user.slice/user-1000.slice/session-51.scope
6:freezer:/
5:devices:/user.slice
4:net_cls,net_prio:/
3:rdma:/
2:perf_event:/
1:name=systemd:/user.slice/user-1000.slice/session-51.scope
0::/user.slice/user-1000.slice/session-51.scope
vagrant@myhost:~$
vagrant@myhost:~$ sudo unshare --cgroup bash
root@myhost:~# cat /proc/self/cgroup
12:cpu,cpuacct:/
11:cpuset:/
10:hugetlb:/
9:blkio:/
8:memory:/
7:pids:/
6:freezer:/
5:devices:/
4:net_cls,net_prio:/
3:rdma:/
2:perf_event:/
1:name=systemd:/
0::/

You have now explored all the different types of namespace and have seen how they are used along with chroot to isolate a process’s view of its surrounding. Combine this with what you learned about cgroups in the previous chapter, and you should have a good understanding of everything that’s needed to make what we call a "container.”

Before moving on to the next chapter, it’s worth taking a look at a container from the perspective of the host it’s running on.

Container Processes from the Host Perspective

Although they are called containers, it might be more accurate to use the term “containerized processes.” A container is still a Linux process running on the host machine, but it has a limited view of that host machine, and it has access to only a subtree of the filesystem and perhaps to a limited set of resources restricted by cgroups. Because it’s really just a process, it exists within the context of the host operating system, and it shares the host’s kernel as shown in Figure 4-3.

Containers share the host's kernel
Figure 4-3. Containers share the host’s kernel

You’ll see how this compares to virtual machines in the next chapter, but before that, let’s examine in more detail the extent to which a containerized process is isolated from the host, and from other containerized processes on that host, by trying some experiments on a Docker container. Start a container process based on Ubuntu (or your favorite Linux distribution) and run a shell in it, and then run a long sleep in it as follows:

$ docker run --rm -it ubuntu bash
root@1551d24a $ sleep 1000

This example runs the sleep command for 1,000 seconds, but note that the sleep command is running as a process inside the container. When you press Enter at the end of the sleep command, this triggers Linux to clone a new process with a new process ID and to run the sleep executable within that process.

You can put the sleep process into the background (Ctrl-Z to pause the process, and bg %1 to background it). Now run ps inside the container to see the same process from the container’s perspective:

me@myhost:~$ docker run --rm -it ubuntu bash
root@ab6ea36fce8e:/$ sleep 1000
^Z
[1]+  Stopped                 sleep 1000
root@ab6ea36fce8e:/$ bg %1
[1]+ sleep 1000 &
root@ab6ea36fce8e:/$ ps
  PID TTY          TIME CMD
    1 pts/0    00:00:00 bash
   10 pts/0    00:00:00 sleep
   11 pts/0    00:00:00 ps
root@ab6ea36fce8e:/$

While that sleep command is still running, open a second terminal into the same host and look at the same sleep process from the host’s perspective:

me@myhost:~$ ps -C sleep
  PID TTY          TIME CMD
30591 pts/0    00:00:00 sleep

The -C sleep parameter specifies that we are interested only in processes running the sleep executable.

The container has its own process ID namespace, so it makes sense that its processes would have low numbers, and that is indeed what you see when running ps in the container. From the host’s perspective, however, the sleep process has a different, high-numbered process ID. In the preceding example, there is just one process, and it has ID 30591 on the host and 10 in the container. (The actual number will vary according to what else is and has been running on the same machine, but it’s likely to be a much higher number.)

To get a good understanding of containers and the level of isolation they provide, it’s really key to get to grips with the fact that although there are two different process IDs, they both refer to the same process. It’s just that from the host’s perspective it has a higher process ID number.

The fact that container processes are visible from the host is one of the fundamental differences between containers and virtual machines. An attacker who gets access to the host can observe and affect all the containers running on that host, especially if they have root access. And as you’ll see in Chapter 9, there are some remarkably easy ways you can inadvertently make it possible for an attacker to move from a compromised container onto the host.

Container Host Machines

As you have seen, containers and their host share a kernel, and this has some consequences for what are considered best practices relating to the host machines for containers. If a host gets compromised, all the containers on that host are potential victims, especially if the attacker gains root or otherwise elevated privileges (such as being a member of the docker group that can administer containers where Docker is used as the runtime).

It’s highly recommended to run container applications on dedicated host machines (whether they be VMs or bare metal), and the reasons mostly relate to security:

  • Using an orchestrator to run containers means that humans need little or no access to the hosts. If you don’t run any other applications, you will need a very small set of user identities on the host machines. These will be easier to manage, and attempts to log in as an unauthorized user will be easier to spot.

  • You can use any Linux distribution as the host OS for running Linux containers, but there are several “Thin OS” distros specifically designed for running containers. These reduce the host attack surface by including only the components required to run containers. Examples include RancherOS, Red Hat’s Fedora CoreOS, and VMware’s Photon OS. With fewer components included in the host machine, there is a smaller chance of vulnerabilities (see Chapter 7) in those components.

  • All the host machines in a cluster can share the same configuration, with no application-specific requirements. This makes it easy to automate the provisioning of host machines, and it means you can treat host machines as immutable. If a host machine needs an upgrade, you don’t patch it; instead, you remove it from the cluster and replace it with a freshly installed machine. Treating hosts as immutable makes intrusions easier to detect.

I’ll come back to the advantages of immutability in Chapter 6.

Using a Thin OS reduces the set of configuration options but doesn’t eliminate them completely. For example, you will have a container runtime (perhaps Docker) plus orchestrator code (perhaps the Kubernetes kubelet) running on every host. These components have numerous settings, some of which affect security. The Center for Internet Security (CIS) publishes benchmarks for best practices for configuring and running various software components, including Docker, Kubernetes, and Linux.

In an enterprise environment, look for a container security solution that also protects the hosts by reporting on vulnerabilities and worrisome configuration settings. You will also want logs and alerts for logins and login attempts at the host level.

Summary

Congratulations! Since you’ve reached the end of this chapter, you should now know what a container really is. You’ve seen the three essential Linux kernel mechanisms that are used to limit a process’s access to host resources:

  • Namespaces limit what the container process can see—for example, by giving the container an isolated set of process IDs.

  • Changing the root limits the set of files and directories that the container can see.

  • Cgroups control the resources the container can access.

As you saw in Chapter 1, isolating one workload from another is an important aspect of container security. You now should be fully aware that all the containers on a given host (whether it is a virtual machine or a bare-metal server) share the same kernel. Of course, the same is true in a multiuser system where different users can log in to the same machine and run applications directly. However, in a multiuser system, the administrators are likely to limit the permissions given to each user; they certainly won’t give them all root privileges. With containers—at least at the time of writing—they all run as root by default and are relying on the boundary provided by namespaces, changed root directories, and cgroups to prevent one container from interfering with another.

Note

Now that you know how containers work, you might want to explore Jess Frazelle’s contained.af site to see just how effective they are. Will you be the person who breaks the containment?

In Chapter 8 we’ll explore options for strengthening the security boundary around each container, but next let’s delve into how virtual machines work. This will allow you to consider the relative strengths of the isolation between containers and between VMs, especially through the lens of security.

Get Container Security now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.