Chapter 1. The Lack of Visibility
Kubernetes has become the de facto cloud operating system, and every day more and more critical applications are containerized and shifted to a cloud native landscape. This means Kubernetes is quickly becoming a rich target for both passive and targeted attackers. Kubernetes does not provide a default security configuration and provides no observability to discern if your pods or cluster has been attacked or compromised.
Understanding your security posture isn’t just about applying security configuration and hoping for the best. Hope isn’t a strategy. Just like the site reliability engineering (SRE) principle of service level objectives (SLOs) that “[identify] an objective metric to represent the property of a system,”1 security observability provides us with a historical and current metric to represent the objective security properties of a system. Security observability allows us to “assess our current [security] and track improvements or degradations over time.”2
With security observability, we can quickly answer:
-
How many pods are running with privileged Linux capabilities in my environment?
-
Have any workloads in my environment made a connection to “known-bad.actorz.com”?
-
Show me all local privilege escalation techniques detected in the last 30 days.
-
Have any workloads other than Fluentd used S3 credentials?
Achieving observability in a cloud native environment can be complicated. It often requires changes to applications or the management of yet another complex distributed system. However, eBPF provides a lightweight methodology to collect security observability natively in the kernel, without any changes to applications.
What Should We Monitor?
Kubernetes is constructed of several independent microservices that run the control plane (API server, controller manager, scheduler) and worker node components (kubelet, kube-proxy, container runtime). In a cloud native environment, there are a slew of additional components that make up a cloud native deployment, including continuous integration/continuous delivery (CI/CD), storage subsystems, container registries, observability (including eBPF), and many more.
Most of the systems that make up the CNCF landscape, including Kubernetes, are not secure by default.3 Each component requires intentional hardening to meet your goals of a least-privilege configuration and defending against a motivated adversary. So, which components should we focus our security observability efforts on? “The greatest attack surface of a Kubernetes cluster is its network interfaces and public-facing pods.”4 For example, an internet-exposed pod that handles untrusted input is a much more likely attack vector than a control plane component on a private network with a hardened RBAC (role-based access control) configuration.
While container images are immutable, containers and pods are standard Linux processes that can have access to a set of binaries, package managers, interpreters, runtimes, etc. Pods can install packages, download tools, make internet connections, and cause all sorts of havoc in a Kubernetes environment, all without logging any of that behavior by default. There’s also the challenge of applying a least-privilege configuration for our workloads, by providing only the capabilities a container requires. Security observability monitors containers and can quickly identify and record all the capabilities a container requires—and nothing more. This means we should start by applying our security observability to pods.
Most organizations that have been around pre–cloud native have existing security/detection tooling for their environments. So, why not just rely on those tools for cloud native security observability? Most legacy security tools don’t support kernel namespaces to identify containerized processes. Existing network logs and firewalls are suboptimal for observability because pod IP addresses are ephemeral, which means that as pods come and go, IP addresses can be reused by entirely different apps by the time you investigate. eBPF security observability natively understands container attributes and provides process and network visibility that’s closer to the pods that we’re monitoring, so we can detect events, including pre-NAT (network address translation), to retain the IP of the pod and understand the container or pod that initiated an action.
High-Fidelity Observability
When investigating a threat, the closer to the event the data is, the higher fidelity the data provides. A compromised pod that escalates its privileges and laterally moves through the network won’t show up in our Kubernetes audit logs. If the pods are on the same host, the lateral movement won’t even show up in our network logs. If our greatest attack surface is pods, we’ll want our security observability as close to pods as possible. The “further out” we place our observability, the less critical security context we’re afforded. For example, firewall or network intrusion detection logs from the network generally map to the source IP address of the node that the offending pod resides on due to packet encapsulation that renders the identity of the source meaningless.
The same lateral movement event can be measured at the virtual ethernet (veth) interface of the pod or the physical network interface of the node. Measuring from the network includes the pre-NAT pod IP address and, with the help of eBPF, we can retrieve Kubernetes labels, namespaces, pod names, etc. We are improving our event fidelity.
But if we wanted to get even closer to pods, eBPF operates in-kernel where process requests are captured. We can assert a more meaningful identity of lateral movement than a network packet at the socket layer (shown in Figure 1-1), which includes the process that invoked the connection, any arguments, and the capabilities it’s running with. Or we can collect process events that never create a packet at all.
This paradigm isn’t unique to eBPF. The security community has been moving away from network-centric security and toward a future where we can monitor and make enforcement decisions based on process behavior instead of a packet header. After all, when’s the last time anyone discovered a sophisticated attack from a packet capture (PCAP)?
A Kubernetes Attack
Let’s consider a hypothetical attack scenario in Kubernetes. (You don’t need to understand details of this attack now, but by the end of this report you’ll understand common attack patterns and how you can take advantage of simple tools to detect sophisticated attacks.)
Imagine you run a multitenant Kubernetes cluster that hosts both public-facing and internal applications. One of your tenants runs an internet-facing application with an insecure version of Apache Struts that’s vulnerable to Log4j.5 A threat actor loads a customized Java string to an input form of the web app, which causes the app to fetch a malicious Java class that is executed by Log4j. The Java class exploits a remote code execution (RCE) that opens a reverse shell connection to a suspicious domain where an attacker is listening.6
The attacker makes a connection into the Apache Struts container and explores the system. The workload wasn’t restricted by the container runtime and has overly permissive Linux capabilities that enables the attacker to mount in the /etc/kubernetes/manifests directory from the host into the container. The attacker then drops a privileged pod manifest in kubelet’s manifest directory. The attacker now has a high-availability, kubelet-managed backdoor into the cluster that supersedes any IAM (identity and access management) or RBAC policies.
None of this is logged or detected, which allows the attacker to maintain a persistent foothold in your cluster, indefinitely and invisibly. This is because by default, not only is there no default security hardening for workloads, there’s also no built-in observability. It is up to the cluster operator to decide what tools to use to understand how their cluster and its apps are behaving, and whether anything malicious is happening.
In this report, we’ll discuss how eBPF can detect attacks, even if they’re invisible to Kubernetes, and how we can use the detected events to build out a security policy to stop them in their tracks.
What Is eBPF?
eBPF is an emerging technology that enables event-driven custom code to run natively in an operating system kernel. This has spawned a new era of network, observability, and security platforms. eBPF extends kernel functionality without requiring changes to applications or the kernel to observe and enforce runtime security policy. eBPF’s origins began with BPF, a kernel technology that was originally developed to aid packet filtering such as the inimitable tcpdump packet-capture utility.
The “enhanced” version of BPF (eBPF) came from an initial patch set of five thousand lines of code, followed by a group of features that slowly trickled into the Linux kernel that provided capabilities for tracing low-level kernel subsystems, drawing inspiration from the superlative DTrace utility. While eBPF is a Linux (and soon, Windows) utility, the omnipresent Kubernetes distributed system has been uniquely positioned to drive the development of eBPF as a container technology.
eBPF’s unique vantage point in the kernel gives Kubernetes teams the power of security observability by understanding all process events, system calls, and networking operations in a Kubernetes cluster. eBPF’s flexibility also enables runtime security enforcement for process events, system calls, and networking operations for all pods, containers, and processes, allowing us to write customizable logic to instrument the kernel on any kernel event.
We will walk you through in detail what eBPF is, how you can use eBPF programs, and why they are vital in cloud native security (Chapter 2). But first we need to understand the basic container technology concepts.
Brief Guide to Container Security
Containers are Linux processes that run in the context of Linux namespaces, cgroups, and capabilities. Google added the first patch to the kernel in 2007, fittingly defining containers as process containers. This name provides a good insight into container technology, because containers are standard Linux processes with some isolated resources like networking and filesystems.
Containers are created and managed in the OS by low-level container runtimes, which are responsible for starting processes, creating cgroups (discussed later), putting processes into their own namespaces (also discussed later), using the unshare system call, and performing any cleanup when the container exits. What’s described here are the basic primitives of creating containers; however, more full-featured, low-level container runtimes like runC have more features.7
With this broad description out of the way, let’s dive into the implementation details to highlight the security features and challenges of containers.
Kernel Namespaces
A process in Linux is an executable program (such as /bin/grep) running in-memory by the kernel. A process gets a process ID or PID (which can be seen when you run ps xao pid,comm
), its own memory address (seen when you run pmap -d $PID
), and file descriptors, used to open, read, and write to files (lsof -p $PID
). Processes run as users with their permissions, either root (UID 0) or nonroot.
Containers use Linux namespaces to isolate these resources, creating the illusion that a container is the only container accessing resources on a system. Namespaces create an isolated view for various resources:
- PID namespace
- This namespace masks process IDs so the container only sees the processes running inside the container and not processes running in other containers or the Kubernetes node.
- Mount namespace
- This namespace unpacks the tarball of a container image (called a base-image) on the node and chroots the directory for the container.8
- Network namespace
- This namespace configures network interfaces and routing tables for containers to send and receive traffic. In Kubernetes, this namespace can be disabled with
hostNetwork
, which provides a container direct access to services listening onlocalhost
on the node and circumvents network policy.9 - IPC namespace
- The IPC (inter-process communication) namespace facilitates shared memory between containers, including multiple containers running in a Kubernetes pod.
- UTS namespace
- This namespace configures the hostname of a container.
- User namespace
- This namespace separates root (UID 0) in a container from root (UID 0) on the node. Note that Kubernetes does not support the user namespace;10 running a container as root can facilitate root on the node in the event of a container breakout. We can mitigate some of this risk by dropping capabilities (discussed later) and using
seccomp
to block system calls in the container, but it’s critical to run your containers as a nonroot user.
Cgroups
Cgroups can limit the node’s CPU and memory resources a container can consume. From a security perspective, this prevents a “noisy neighbor” or DoS (denial of service) attack where one container consumes all hardware resources on a node. Containers that exceed CPU will be rate limited by cgroups, whereas exceeding memory limits will cause an out-of-memory kill (OOM kill) event.
Attack Points for Container Escapes
Attackers have targeted some nonnamespaced resources because they can provide a malicious container direct access to node resources. These resources include kernel modules, /dev, /sys/, /proc/sys/, sysctl
settings, and more.
In addition to namespaces, containers utilize a mechanism called Linux capabilities to provide a more granular set of credentials to containers.
Linux Capabilities
In the old world, processes were either run as root (UID 0) or as a standard user (!=UID 0). This system was binary; either a process was root and could do (almost) anything or it was a normal user and was restrained to its own resources. Sometimes unprivileged processes need privileged capabilities, such as ping sending raw packets without granting it root permissions. To solve this, the kernel introduced capabilities, which gives unprivileged processes more granular security capabilities, such as the capability CAP_NET_RAW
to enable ping to send raw packets.
Capabilities can be implemented on a file or a process. To observe the capabilities that a running process has, we can inspect the kernel’s virtual filesystem, /proc:
grep -E 'Cap|Priv' /proc/$(pgrep ping)/status CapInh: 0000003fffffffff CapPrm: 0000003fffffffff CapEff: 0000003fffffffff CapBnd: 0000003fffffffff CapAmb: 0000000000000000 NoNewPrivs: 0
We can then use the capsh
binary to decode the values into human readable capabilities:
capsh --decode=0000003fffffffff 0x0000003fffffffff=cap_chown,cap_dac_override... cap_net_raw... cap_sys_admin...
We can see the CAP_NET_RAW
capability here as well as a slew of other capabilities because the root user can make any kernel function call.
There are several capability sets a process or file can be granted (effective, permitted, inheritable, ambient), but we’ll just cover effective. The effective capability set indicates what capabilities are active in a process. For example, when a process attempts to perform a privileged operation, the kernel will check for the appropriate capability bit in the effective set of the process.
This chapter has covered the very basics of container security; however, the authors highly recommend supplementing your reading with Liz Rice’s Container Security (O’Reilly).11 Now we can turn to how eBPF can illuminate security issues in Kubernetes, a distributed system that is responsible for running production containers.
1 SLOs are covered in more detail in Site Reliability Engineering by Betsy Beyer et al. (O’Reilly), which is free to read.
2 Beyer et al., Site Reliability Engineering.
3 The wonderful CNCF Technical Security Group has been working on secure defaults guidelines for CNCF projects.
4 Andrew Martin and Michael Hausenblas, Hacking Kubernetes (O’Reilly).
5 The Log4j vulnerability is due to Log4j parsing logs and attempting to resolve the data and variables in its input. The JNDI lookup allows variables to be fetched and resolved over a network, including to arbitrary entities on the internet. More details are in the CVE.
6 Suspicious domains can include a domain generation algorithm.
7 runC is currently the most widely used low-level container runtime. It’s responsible for “spawning and running containers on Linux according to the OCI specification.”
8 Container runtimes can block the CAP_SYS_CHROOT
capability by default, and pivot_root
is used due to security issues with accessible mounts.
9 Network policy allows you to specify the allowed connections a pod can make. It’s basically a firewall for containers. Several CNIs such as Cilium provide custom resource definitions (CRDs) for network policy to extend functionality to provide a layer 7 firewall, cluster-wide policies, and more.
10 There is an alpha (as of Kubernetes 1.22) project to run Kubernetes Node components in the user namespace.
11 This is required reading for anyone responsible for securing a cloud native environment.
Get Security Observability with eBPF now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.