Chapter 4. Tracing with BPF

In software engineering, tracing is a method to collect data for profiling and debugging. The objective is to provide useful information at runtime for future analysis. The main advantage of using BPF for tracing is that you can access almost any piece of information from the Linux kernel and your applications. BPF adds a minimum amount of overhead to the system’s performance and latency in comparison with other tracing technologies, and it doesn’t require developers to modify their applications for the only purpose of gathering data from them.

The Linux kernel provides several instrumentation capabilities that can be used in conjunction with BPF. In this chapter we talk about these different capabilities. We show you how the kernel exposes those capabilities in your operating system so that you know how to find the information available to your BPF programs.

Tracing’s end goal is to provide you with a deep understanding of any system by taking all of the available data and presenting it to you in a useful way. We’re going to talk about a few different data representations and how you can use them in different scenarios.

Beginning in this chapter, we’re going to use a powerful toolkit to write BPF programs, the BPF Compiler Collection (BCC). BCC is a set of components that makes building BPF programs more predictable. Even if you master Clang and LLVM, you won’t probably want to spend more time than necessary building the same utilities and ensuring that the BPF verifier doesn’t reject your programs. BCC provides reusable components for common structures, like Perf event maps, and integration with the LLVM backend to provide better debugging options. On top of that, BCC includes bindings for several programming languages; we’re going to use Python in our examples. These bindings allow you to write the user-space part of your BPF programs in a high-level language, which results in more useful programs. We also use BCC in following chapters to make our examples more concise.

The first step to be able to trace programs in the Linux kernel is to identify the extension points that it provides for you to attach BPF programs. Those extension points are commonly called probes.

Probes

One of the definitions in the English dictionary for the word probe is as follows:

An unmanned exploratory spacecraft designed to transmit information about its environment.

This definition evokes memories of sci-fi movies and epic NASA missions in our minds, and probably in yours too. When we talk about tracing probes, we can use a very similar definition.

Tracing probes are exploratory programs designed to transmit information about the environment in which they are executed.

They collect data in your system and make it available for you to explore and analyze. Traditionally, using probes in Linux involved writing programs that were compiled into kernel modules, which could cause catastrophic problems in production systems. Over the years, they evolved to be safer to execute but still cumbersome to write and test. Tools like SystemTap established new protocols to write probes and paved the way to get much richer information from the Linux kernel and all programs running on user-space.

BPF piggybacks on tracing probes to collect information for debugging and analysis. The safety nature of BPF programs makes them more compelling to use than tools that still rely on recompiling the kernel. Re-compiling the kernel to include external modules can introduce a risk of crashes due to missbehaving code. The BPF verifier eliminates this risk by analyzing the program before loading in the kernel. The BPF developers took advantage of probe definitions, and modified the kernel to execute BPF programs rather than kernel modules when a code execution finds one of those definitions.

Understanding the different types of probes that you can define is fundamental to exploring what’s happening within your system. In this section, we classify the different probe definitions, how to discover them in your system, and how to attach BPF programs to them.

In this chapter, we cover four different types of probes:

Kernel probes

These give you dynamic access to internal components in the kernel.

Tracepoints

These provide static access to internal components in the kernel.

User-space probes

These give you dynamic access to programs running in user-space.

User statically defined tracepoints

These allow static access to programs running in user-space.

Let’s begin with kernel probes.

Kernel Probes

Kernel probes allow you to set dynamic flags, or breaks, in almost any kernel instruction with a minimum of overhead. When the kernel reaches one of these flags, it executes the code attached to the probe, and then resumes its usual routine. Kernel probes can give you information about anything happening in your system, such as files opened in your system and binaries being executed. One important thing to keep in mind about kernel probes is that they don’t have a stable application binary interface (ABI), which means that they might change between kernel versions. The same code might stop working if you try to attach the same probe to two systems with two different kernel versions.

Kernel probes are divided into two categories: kprobes and kretprobes. Their use depends on where in the execution cycle you can insert your BPF program. This section guides you on how to use each one of them to attach BPF programs to those probes and extract information from the kernel.

Kprobes

Kprobes allow you to insert BPF programs before any kernel instruction is executed. You need to know the function signature that you want to break into, and as we mentioned earlier, this is not a stable ABI, so you’ll want to be careful setting these probes if you’re going to run the same program in different kernel versions. When the kernel execution arrives to the instruction where you set your probe, it sidesteps into your code, runs your BPF program, and returns the execution to the original instruction.

To show you how to use kprobes, we’re going to write a BPF program that prints the name of any binary that’s executed in your system. We’re going to use the Python frontend for the BCC tools in this example, but you can write it with any other BPF tooling:

from bcc import BPF

bpf_source = """
int do_sys_execve(struct pt_regs *ctx, void filename, void argv, void envp) { 1
  char comm[16];
  bpf_get_current_comm(&comm, sizeof(comm));
  bpf_trace_printk("executing program: %s", comm);
  return 0;
}
"""

bpf = BPF(text = bpf_source)	2
execve_function = bpf.get_syscall_fnname("execve")		3
bpf.attach_kprobe(event = execve_function, fn_name = "do_sys_execve")	4
bpf.trace_print()
1

Our BPF program starts. The helper bpf_get_current_comm is going to fetch the current command’s name that the kernel is running and store it in our comm variable. We’ve defined this as a fixed-length array because the kernel has a 16-character limit for command names. After getting the command name, we print it in our debug trace, so the person running the Python script can see all commands captured by BPF.

2

Load the BPF program into the kernel.

3

Associate the program with the execve syscall. The name of this syscall has changed in different kernel versions, and BCC provides a function to retrieve this name without you having to remember which kernel version you’re running.

4

The code outputs the trace log, so you can see all of the commands that you’re tracing with this program.

Kretprobes

Kretprobes are going to insert your BPF program when a kernel instruction returns a value after being executed. Usually, you’ll want to combine both kprobes and kretprobes into a single BPF program so that you have a full picture of the instruction’s behavior.

We’re going to use a similar example to the one in the previous section to show you how kretprobes work:

from bcc import BPF

bpf_source = """
int ret_sys_execve(struct pt_regs *ctx) {	1
  int return_value;
  char comm[16];
  bpf_get_current_comm(&comm, sizeof(comm));
  return_value = PT_REGS_RC(ctx);

  bpf_trace_printk("program: %s, return: %d", comm, return_value);
  return 0;
}
"""

bpf = BPF(text = bpf_source)	   2
execve_function = bpf.get_syscall_fnname("execve")
bpf.attach_kretprobe(event = execve_function, fn_name = "ret_sys_execve")  3
bpf.trace_print()
1

Define the function that implements the BPF program. The kernel will execute it immediately after the execve syscall finishes. PT_REGS_RC is a macro that’s going to read the returned value from BPF register for this specific context. We also use bpf_trace_printk to print the command and its returned value in our debug log.

2

Initialize the BPF program and load it in the kernel.

3

Change the attachment function to attach_kretprobe.

Kernel probes are a powerful way to access the kernel. But as we mentioned earlier, they might be unstable because you’re attaching to dynamic points in the kernel’s source that might change or disappear from one version to another. Now you’ll see a different method to attach programs to the kernel that is safer.

Tracepoints

Tracepoints are static markers in the kernel’s code that you can use to attach code in a running kernel. The main difference with kprobes is that they are codified by the kernel developers when they implement changes in the kernel; that’s why we refer to them as static. Because they are static, the ABI for tracepoints is more stable; the kernel always guarantees that a tracepoint in an old version is going to exist in new versions. However, given that developers need to add them to the kernel, they might not cover all the subsystems that form the kernel.

As we mentioned in Chapter 2, you can see all of the available tracepoints in your system by listing all the files in /sys/kernel/debug/tracing/events. For example, you can find all of the tracepoints for BPF itself by listing the events defined in /sys/kernel/debug/tracing/events/bpf:

sudo ls -la /sys/kernel/debug/tracing/events/bpf
total 0
drwxr-xr-x  14 root root 0 Feb  4 16:13 .
drwxr-xr-x 106 root root 0 Feb  4 16:14 ..
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_map_create
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_map_delete_elem
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_map_lookup_elem
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_map_next_key
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_map_update_elem
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_obj_get_map
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_obj_get_prog
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_obj_pin_map
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_obj_pin_prog
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_prog_get_type
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_prog_load
drwxr-xr-x   2 root root 0 Feb  4 16:13 bpf_prog_put_rcu
-rw-r--r--   1 root root 0 Feb  4 16:13 enable
-rw-r--r--   1 root root 0 Feb  4 16:13 filter

Every subdirectory listed in that output corresponds to a tracepoint that we can attach BPF programs to. But there are two additional files there. The first file, enable, allows you to enable and disable all tracepoints for the BPF subsystem. If the content of the file is 0, the tracepoints are disabled; if the content of the file is 1, the tracepoints are enabled. The filter file allows you to write expressions that the Trace subsystem in the kernel will use to filter events. BPF doesn’t use this file; read more in the kernel’s tracing documentation.

Writing BPF programs to take advantage of tracepoints is similar to tracing with kprobes. Here’s an example that uses a BPF program to trace all of the applications in your system that load other BPF programs:

from bcc import BPF

bpf_source = """
int trace_bpf_prog_load(void ctx) {	1
  char comm[16];
  bpf_get_current_comm(&comm, sizeof(comm));

  bpf_trace_printk("%s is loading a BPF program", comm);
  return 0;
}
"""

bpf = BPF(text = bpf_source)
bpf.attach_tracepoint(tp = "bpf:bpf_prog_load",
                      fn_name = "trace_bpf_prog_load") 2
bpf.trace_print()
1

Declare the function that defines the BPF program. This code must look familiar to you already; there are only a few syntactic changes from the first example you saw when we talked about kprobes.

2

The main difference in this program: instead of attaching the program to a kprobe, we’re attaching it to a tracepoint. BCC follows a convention to name tracepoints; first you specify the subsystem to trace—bpf in this case—followed by a colon, followed by the tracepoint in the subsystem, pbf_prog_load. This means that every time the kernel executes the function bpf_prog_load, this program will receive the event, and it will print the name of the application that’s executing that bpf_prog_load instruction.

Kernel probes and tracepoints are going to give you full access to the kernel. We recommend that you to use tracepoints whenever possible, but don’t feel obligated to stick to tracepoints only because they are safer. Take advantage of the dynamic nature of the kernel probes. In the next section we discuss how to get a similar level of visibility in programs running in user-space.

User-Space Probes

User-space probes allow you to set dynamic flags in programs running in user-space. They are the equivalent of kernel-probes for instrumenting programs that run outside the kernel. When you define a uprobe, the kernel creates a trap around the attached instruction. When your application reaches that instruction, the kernel triggers an event that has your probe function as a callback. Uprobes also give you access to any library that your program is linked to, and you can trace those calls if you know the correct name for the instruction.

Much like kernel probes, user-space probes are also divided in two categories, uprobes and uretprobes, depending on where in the execution cycle you can insert your BPF program. Let’s jump right in with some examples.

Uprobes

Generally speaking, uprobes are hooks that the kernel inserts into a program’s instruction set before a specific instruction is executed. You need to be careful when you attach uprobes to different versions of the same program because function signatures might change internally between those versions. The only way to guarantee that a BPF program is going to run in two different versions is to ensure that the signature has not changed. You can use the command nm in Linux to list all the symbols included in an ELF object file, which is a good way to check whether the instruction that you’re tracing still exists in your program, for example:

package main
import "fmt"

func main() {
        fmt.Println("Hello, BPF")
}

You can compile this Go program by using go build -o hello-bpf main.go. You can use the command nm to get information about all the instruction points that the binary file includes. nm is a program included in the GNU Development Tools that lists symbols from object files. If you filter the symbols with main in their name, you get a list similar to this one:

nm hello-bpf | grep main
0000000004850b0 T main.init
00000000567f06 B main.initdone.
00000000485040 T main.main
000000004c84a0 R main.statictmp_0
00000000428660 T runtime.main
0000000044da30 T runtime.main.func1
00000000044da80 T runtime.main.func2
000000000054b928 B runtime.main_init_done
00000000004c8180 R runtime.mainPC
0000000000567f1a B runtime.mainStarted

Now that you have a list of symbols, you can trace when they are executed, even between different processes executing the same binary.

To trace when the main function in our previous Go example is executed, we’re going to write a BPF program, and we’re going to attach it to a uprobe that will break before any process invokes that instruction:

from bcc import BPF

bpf_source = """
int trace_go_main(struct pt_regs *ctx) {
  u64 pid = bpf_get_current_pid_tgid();		1
  bpf_trace_printk("New hello-bpf process running with PID: %d", pid);
}
"""

bpf = BPF(text = bpf_source)
bpf.attach_uprobe(name = "hello-bpf",
    sym = "main.main", fn_name = "trace_go_main")	2
bpf.trace_print()
1

Use the function bpf_get_current_pid_tgid to get the process identifier (PID) for the process that’s running our hello-bpf program.

2

Attach this program to a uprobe. This call needs to know that the object we want to trace, hello-bpf, is the absolute path to the object file. It also needs the symbol that we’re tracing inside the object, main.main in this case, and the BPF program that we want to run. With this, every time someone runs hello-bpf in our system, we’ll get a new log in our trace pipe.

Uretprobes

Uretprobes are the parallel probe to kretprobes, but for user-space programs. They attach BPF programs to instructions that return values, and give you access to those returned values by accessing the registers from your BPF code.

Combining uprobes and uretprobes allows you to write more complex BPF programs. They can give you a more holistic view of applications running in your system. When you can inject tracing code before a function runs and immediately after it completes, you can begin gathering more data and measure application behaviors. A common use case is to measure how long a function takes to execute, without having to change a single line of code in your application.

We’re going to reuse the Go program we wrote in “Uprobes” to measure how long it takes to execute the main function. This BPF example is longer than the previous examples that you’ve seen, so we’ve divided it into different blocks of code:

bpf_source = """
int trace_go_main(struct pt_regs *ctx) {
  u64 pid = bpf_get_current_pid_tgid();
  bpf_trace_printk("New hello-bpf process running with PID: %d", pid); 1
}
"""

bpf = BPF(text = bpf_source)
bpf.attach_uprobe(name = "hello-bpf", 	2
    sym = "main.main", fn_name = "trace_go_main")	3
bpf.trace_print()
1

Create a BPF hash map. This table allows us to share data between the uprobe and uretprobe functions. In this case we use the application PID as the table key, and we store the function start time as the value. The two most interesting operations in our uprobe function happen as described next.

2

Capture the current time in the system in nanoseconds, as seen by the kernel.

3

Create an entry in our cache with the program PID and the current time. We can assume that this time is the application’s function start time. Let’s declare our uretprobe function now:

Implement the function to attach when your instruction finishes. This uretprobe function is similar to others that you saw in “Kretprobes”:

bpf_source += """
static int print_duration(struct pt_regs *ctx) {
  u64 pid = bpf_get_current_pid_tgid();       1
  u64 start_time_ns = cache.lookup(&pid);
  if (start_time_ns == 0) {
    return 0;
  }
  u64 duration_ns = bpf_ktime_get_ns() - start_time_ns;
  bpf_trace_printk("Function call duration: %d", duration_ns);  2
  return 0;     3
}
"""
1

Obtain the PID for our application; we need it to find its starting time. We use the map function lookup to fetch that time from the map where we stored it before the function ran.

2

Calculate the function duration by subtracting that time from the current time.

3

Print the latency in our trace log so we can display it in the terminal.

Now, the rest of the program needs to attach these two BPF functions to the right probes:

bpf = BPF(text = bpf_source)
bpf.attach_uprobe(name = "hello-bpf", sym = "main.main",
           fn_name = "trace_start_time")
bpf.attach_uretprobe(name = "hello-bpf", sym = "main.main",
           fn_name = "print_duration")
bpf.trace_print()

We’ve added a line to our original uprobe example where we’re attaching our print function to the uretprobe for our application.

In this section you saw how to trace operations that happen in user-space with BPF. By combining BPF functions that are executed at different points in your application’s lifecycle, you can begin extracting much richer information from it. However, as we mentioned at the beginning of this section, user-space probes are powerful, but they are also unstable. Our BPF examples can stop working only because someone decides to rename an application’s function. Now let’s look at a more stable way to trace user-space programs.

User Statically Defined Tracepoints

User statically defined tracepoints (USDTs) provide static tracepoints for applications in user-space. This is a convenient way to instrument applications because they give you a low-overhead entry point to the tracing capabilities that BPF offers. You can also use them as a convention to trace applications in production, regardless of the programming language with which these applications are written.

USDTs were popularized by DTrace, a tool originally developed at Sun Microsystems for dynamic instrumentation of Unix systems. DTrace was not available in Linux until recently due to licensing issues; however, the Linux kernel developers took a lot of inspiration from the original work in DTrace to implement USDTs.

Much like the static kernel tracepoints you saw earlier, USDTs require developers to instrument their code with instructions that the kernel will use as traps to execute BPF programs. The Hello World version of USDTs is only a few lines of code:

 #include <sys/sdt.h>
 int main() {
   DTRACE_PROBE("hello-usdt", "probe-main");
 }

In this example, we’re using a macro that Linux provides to define our first USDT. You can already see where the kernel takes inspiration from. DTRACE_PROBE is going to register the tracepoint that the kernel will use to inject our BPF function callback. The first argument in this macro is the program that’s reporting the trace. The second one is the name of the trace that we’re reporting.

Many applications that you might have installed in your system use this type of probe to give you access to runtime tracing data in a predictable way. The popular database MySQL, for example, exposes all kinds of information using statically defined tracepoints. You can gather information from queries executed in the server as well as from many other user operations. Node.js, the JavaScript runtime built on top of Chrome’s V8 engine, also provides tracepoints that you can use to extract runtime information.

Before showing you how to attach BPF programs to user-defined tracepoint, we need to talk about discoverability. Because these tracepoints are defined in binary format inside the executable files, we need a way to list the probes defined by a program without having to dig through the source code. One way to extract this information is by reading the ELF binary directly. First, we’re going to compile our previous Hello World USDT example; we can use GCC for that:

gcc -o hello_usdt hello_usdt.c

This command is going to generate a binary file called hello_usdt that we can use to start playing with several tools to discover the tracepoints that it defines. Linux provides a utility called readelf to show you information about ELF files. You can use it with our compiled example:

readelf -n ./hello_usdt

You can see the USDT that we defined in the output of this command:

Displaying notes found in: .note.stapsdt
  Owner                 Data size        Description
  stapsdt              0x00000033        NT_STAPSDT (SystemTap probe descriptors)
    Provider: "hello-usdt"
    Name: "probe-main"

readelf can give you a lot of information about a binary file; in our small example, it shows only a few lines of information, but its output becomes cumbersome to parse for more complicated binaries.

A better option to discover the tracepoints defined in a binary file is to use BCC’s tplist tool, which can display both kernel tracepoints and USDTs. The advantage of this tool is the simplicity of its output; it shows you only tracepoint definitions, without any additional information about the executable. Its usage is similar to readelf:

 tplist -l ./hello_usdt

It lists every tracepoint that you define in individual lines. In our example, it displays only a single line with our probe-main definition:

 ./hello_usdt "hello-usdt":"probe-main"

After you know the supported tracepoints in your binary, you can attach BPF programs to them in a similar way to what you’ve seen in previous examples:

from bcc import BPF, USDT

bpf_source = """
#include <uapi/linux/ptrace.h>
int trace_binary_exec(struct pt_regs *ctx) {
  u64 pid = bpf_get_current_pid_tgid();
  bpf_trace_printk("New hello_usdt process running with PID: %d", pid);
}
"""

usdt = USDT(path = "./hello_usdt")	1
usdt.enable_probe(probe = "probe-main", fn_name = "trace_binary_exec")     2
bpf = BPF(text = bpf_source, usdt = usdt)	3
bpf.trace_print()

There is a major change in this example that requires some explanation.

1

Create a USDT object; we haven’t done this in our previous examples. USDTs are not part of BPF, in the sense that you can use them without having to interact with the BPF VM. Because they are independent of one another, it makes sense that their usage is independent of the BPF code.

2

Attach the BPF function to trace program executions to the probe in our application.

3

Initialize our BPF environment with the tracepoint definition that we just created. This will inform BCC that it needs to generate the code to connect our BPF program with the probe definition in our binary file. When both of them are connected, we can print the traces generated by our BPF program to discover new executions of our binary example.

USDTs bindings for other languages

You can also use USDTs to trace applications written with programming languages besides C. You’ll be able to find bindings for Python, Ruby, Go, Node.js, and many other languages in GitHub. The Ruby bindings are one of our favorites because of their simplicity and interoperability with frameworks such as Rails. Dale Hamel, who currently works at Shopify, wrote an excellent report about the usage of USDTs in his blog. He also maintains a library called ruby-static-tracing that makes tracing Ruby and Rails applications even more straightforward.

Hamel’s static tracing library allows you to inject tracing capabilities at the class level without requiring you to add the tracing logic to every method in that class. In complex scenarios, it also gives you convenient methods to register dedicated tracing endpoints yourself.

To use ruby-static-tracing in your applications, first you need to configure when the tracepoints are going to be enabled. You can turn them on by default when the application starts, but if you want to avoid the overhead of collecting data all the time, you can activate them using a syscall signal. Hamel recommends using PROF as this signal:

require 'ruby-static-tracing'

StaticTracing.configure do |config|
  config.mode = StaticTracing::Configuration::Modes::SIGNAL
  config.signal = StaticTracing::Configuration::Modes::SIGNALS::SIGPROF
end

With this configuration in place, you can use the kill command to enable your application’s static tracepoints on demand. In the next example, we assume that there is only a Ruby process running on our machine, and we can get its process identifier using pgrep:

kill -SIGPROF `pgrep -nx ruby`

Besides configuring when the tracepoints are active, you might want to use some of the built-in tracing mechanisms that ruby-static-tracing provides. At the time of writing this, the library incorporates tracepoints to measure latency and to collect stack traces. We really like how a tedious task such as measuring function latency becomes almost trivial by using this built-in module. First, you need to add the latency tracer to your initial configuration:

require 'ruby-static-tracing'
require 'ruby-static-tracing/tracer/concerns/latency_tracer'

StaticTracing.configure do |config|
  config.add_tracer(StaticTracing::Tracer::Latency)
end

After that, every class that includes the latency module generates static tracepoints for each public method defined. When tracing is enabled, you can query those tracepoints to collect timing data. In our next example, ruby-static-tracing generates a static tracepoint named usdt:/proc/X/fd/Y:user_model:find, following the convention of using the class name as the namespace for the tracepoint and using the method name as the tracepoint name:

class UserModel
  def find(id)
  end

  include StaticTracing::Tracer::Concerns::Latency
end

Now we can use BCC to extract the latency information for each call to our find method. To do that, we use BCC’s built-in functions bpf_usdt_readarg and bpf_usdt_readarg_p. These functions read the arguments set each time our application’s code is executed. ruby-static-tracing always sets the method name as the first argument for the tracepoint, whereas it sets the calculated value as the second argument. The next snippet implements the BPF program that gets the tracepoint information and prints it in the tracing log:

bpf_source = """
#include <uapi/linux/ptrace.h>
int trace_latency(struct pt_regs *ctx) {
  char method[64];
  u64 latency;

  bpf_usdt_readarg_p(1, ctx, &method, sizeof(method));
  bpf_usdt_readarg(2, ctx, &latency);

  bpf_trace_printk("method %s took %d ms", method, latency);
}
"""

We also need to load the previous BPF program into the kernel. Because we’re tracing a specific application that’s already running in our machine, we can attach the program to the specific process identifier:

parser = argparse.ArgumentParser()
parser.add_argument("-p", "--pid", type = int, help = "Process ID")	1
args = parser.parse_args()

usdt = USDT(pid = int(args.pid))
usdt.enable_probe(probe = "latency", fn_name = "trace_latency")		2
bpf = BPF(text = bpf_source, usdt = usdt)
bpf.trace_print()
1

Specify that PID.

2

Enable the probe, load the program into the kernel, and print the tracing log. (This section is very similar to the one you saw earlier.)

In this section, we’ve shown you how to introspect applications that define tracepoints statically. Many well-known libraries and programing languages include these probes to help you debug running applications, gaining more visibility when they run in production environments. This is only the tip of the iceberg; after you have the data, you need to make sense of it. This is what we explore next.

Visualizing Tracing Data

So far, we’ve shown examples that print data in our debug output. This is not very useful in production environments. You want to make sense of that data, but nobody likes to make sense of long and complicated logs. If we want to monitor changes in latency and CPU utilization, it’s easier to do it by looking at graphs over a time period than aggregating numbers from a file stream.

This section explores different ways to present BPF tracing data. On one hand, we’ll show you how BPF programs can structure information in aggregates for you. On the other hand, you’ll learn how to export that information in a portable representation and use off-the-shelf tools to access a richer representation and share your findings with other people.

Flame Graphs

Flame graphs are diagrams that help you visualize how your system is spending time. They can give you a clear representation of which code in an application is executed more often. Brendan Gregg, the creator of flame graphs, maintains a set of scripts to easily generate these visualization formats on GitHub. We use those scripts to generate flame graphs from data collected with BPF later in this section. You can see what these graphs look like in Figure 4-1.

Flame graph visualization
Figure 4-1. A CPU flame graph

There are two important things to remember about what a flame graph shows:

  • The x-axis is ordered alphabetically. The width of each stack represents how often it appears in the collect data, which can be correlated to how often that code path has been visited while the profiler was enabled.

  • The y-axis shows the stack traces ordered as the profiler reads them, preserving the trace hierarchy.

The most well-known flame graphs represent the most frequent code consuming CPU in your system; these are called on-CPU graphs. Another interesting flame graph visualization is off-CPU graphs; they represent the time that a CPU spends on other tasks that are not related to your application. By combining on-CPU and off-CPU graphs, you can get a complete view of what your system is spending CPU cycles on.

Both, on-CPU and off-CPU graphs use stack traces to indicate where the system is spending time. Some programming languages, such as Go, always include trace information in their binaries, but others, such as C++ and Java, require some extra work to make stack traces readable. After your application includes stack trace information, BPF programs can use it to aggregate the most frequent code paths as seen by the kernel.

Note

There are advantages and disadvantages to stack trace aggregation in the kernel. On one hand, it’s an efficient way to count stack trace frequency because it happens in the kernel, avoiding sending every stack information to user-space and reducing the data interchange between the kernel and user-space. On the other hand, the number of events to process for off-CPU graphs can be significantly high, because you’re keeping track of every event that occurs during your application’s context switch. This can create significant overhead in your system if you try to profile it for too long. Keep this in mind when you’re working with flame graphs.

BCC provides several utilities to help you aggregate and visualize stack traces, but the main one is the macro BPF_STACK_TRACE. This macro generates a BPF map of type BPF_MAP_TYPE_STACK_TRACE to store the stacks that your BPF program accumulates. On top of that, this BPF map is enhanced with methods to extract the stack information from the program’s context and walk the stack traces accumulated when you want to use them after aggregating them.

In the next example, we build a simple BPF profiler that prints the stack traces collected from user-space applications. We generate on-CPU flame graphs with the traces that our profiler collects. To test this profiler, we’re going to write a minimal Go program that generates CPU load. This is the code for that minimal application:

package main

import "time"

func main() {
	j := 3
	for time.Since(time.Now()) < time.Second {
		for i := 1; i < 1000000; i++ {
			j *= i
		}
	}
}

If you save this code in a file called main.go and run it with go run main.go, you’ll see that your system’s CPU utilization increases significantly. You can stop the execution by pressing Ctrl-C on your keyboard, and the CPU utilization will go back to normal.

The first part of our BPF program is going to initialize the profiler structures:

bpf_source = """
#include <uapi/linux/ptrace.h>
#include <uapi/linux/bpf_perf_event.h>
#include <linux/sched.h>

struct trace_t { 	1
  int stack_id;
}

BPF_HASH(cache, struct trace_t);	2
BPF_STACK_TRACE(traces, 10000);		3
"""
1

Initialize a structure that will store the reference identifier for each one of the stack frames that our profiler receives. We use these identifiers later to find out which code path was being executed at that time.

2

Initialize a BPF hash map that we use to aggregate how often we see the same strack frame. The flame graph scripts use this aggregated value to determine how frequent the same code is executed.

3

Initialize our BPF stack trace map. We’re setting a maximum size for this map, but it can vary depending on how much data you want to process. It would be better to have this value as a variable, but we know that our Go application is not very big, so 10,000 elements is plenty.

Next, we implement the function that aggregates stack traces in our profiler:

bpf_source += """
int collect_stack_traces(struct bpf_perf_event_data *ctx) {
  u32 pid = bpf_get_current_pid_tgid() >> 32;		1
  if (pid != PROGRAM_PID)
    return 0;

  struct trace_t trace = {	2
    .stack_id = traces.get_stackid(&ctx->regs, BPF_F_USER_STACK)
  };

  cache.increment(trace);	3
  return 0;
}
"""
1

Verify that the process ID for the program in the current BPF context is the one for our Go application; otherwise, we ignore the event. We have not defined the value for PROGRAM_PID at the moment. Let’s replace this string in the Python part of the profiler before initializing the BPF program. This is a current limitation in the way BCC initializes BPF program; we cannot pass any variables from user-space, and as a common practice, these strings are replaced in the code before initialization.

2

Create a trace to aggregate its usage. We fetch the stack ID from the program’s context with the built-in function get_stackid. This is one of the helpers that BCC adds to our stack trace map. We use the flag BPF_F_USER_STACK to indicate that we want to get the stack ID for the user-space application, and we don’t care about what happens inside the kernel.

3

Increment the counter for our trace to keep track of how often the same code is being exercised.

Next, we’re going to attach our stack trace collector to all Perf events in the kernel:

program_pid = int(sys.argv[0])		1
bpf_source = bpf_source.replace('PROGRAM_PID', program_pid) 2

bpf = BPF(text = bpf_source)
bpf.attach_perf_event(ev_type = PerfType.SOFTWARE,          3
                      ev_config = PerfSWConfig.CPU_CLOCK,
                      fn_name = 'collect_stack_traces')
1

The first argument for our Python program. This is the process identifier for the Go application that we’re profiling.

2

Use Python’s built-in replace function to swap the string PROGRAM_ID in our BPF source with the argument provided to the profiler.

3

Attach the BPF program to all Software Perf events, this will ignore any other events, like Hardware events. We’re also configuring our BPF program to use the CPU clock as time source so we can measure execution time.

Finally, we need to implement the code that will dump the stack traces in our standard output when the profiler is interrupted:

try:
  sleep(99999999)
except KeyboardInterrupt:
  signal.signal(signal.SIGINT, signal_ignore)

for trace, acc in sorted(cache.items(), key=lambda cache: cache[1].value):   1
  line = []
  if trace.stack_id < 0 and trace.stack_id == -errno.EFAULT		     2
    line = ['Unknown stack']
  else
    stack_trace = list(traces.walk(trace.stack_id))
    for stack_address in reversed(stack_trace)			             3
      line.extend(bpf.sym(stack_address, program_pid))		     	4

  frame = b";".join(line).decode('utf-8', 'replace')		     	5
  print("%s %d" % (frame, acc.value))
1

Iterate over all the traces we collected so that we can print them in order.

2

Validate that we got stack identifiers that we can correlate with specific lines of code later. If we get an invalid value, we’ll use a placeholder in our flame graph.

3

Iterate over all the entries in the stack trace in reverse order. We do this because we want to see the first most recently executed code path at the top, like you’d expect in any stack trace.

4

Use the BCC helper sym to translate the memory address for the stack frame into a function name in our source code.

5

Format the stack trace line separated by semicolons. This is the format that the flame graph scripts expect later to be able to generate our visualization.

With our BPF profiler complete, we can run it as sudo to collect stack traces for our busy Go program. We need to pass the Go program’s process ID to our profiler to make sure that we collect only traces for this application; we can find that PID using pgrep. This is how you run the profiler if you save it in a file called profiler.py:

./profiler.py `pgrep -nx go` > /tmp/profile.out

pgrep will search the PID for a process running on your system whose name matches go. We send our profiler’s output to a temporary file so that we can generate the flame graph visualization.

As we mentioned earlier, we’re going to use Brendan Gregg’s FlameGraph scripts to generate an SVG file for our graph; you can find those scripts in his GitHub repository. After you’ve downloaded that repository, you can use flamegraph.pl to generate the graph. You can open the graph with your favorite browser; we’re using Firefox in this example:

./flamegraph.pl /tmp/profile.out > /tmp/flamegraph.svg && \
  firefox /tmp/flamefraph.svg

This kind of profiler is useful for tracing performance issues in your system. BCC already includes a more advanced profiler than the one in our example that you can use in your production environments directly. Besides the profiler, BCC includes tools to help you generate off-CPU flame graphs and many other visualizations to analyze systems.

Flame graphs are useful for performance analysis. We use them frequently in our day-to-day work. In many scenarios, besides visualizing hot code paths, you’ll want to measure how often events in your systems occur. We focus on that next.

Histograms

Histograms are diagrams that show you how frequently several ranges of values occur. The numeric data to represent this is divided into buckets, and each bucket contains the number of occurrences of any data point within the bucket. The frequency that histograms measure is the combination of the height and width of each bucket. If the buckets are divided in equal ranges, this frequency matches the histogram’s height, but if the ranges are not divided equally, you need to multiply each height by each width to find the correct frequency.

Histograms are a fundamental component to do systems performance analysis. They are a great tool to represent the distribution of measurable events, like instruction latency, because they show you more correct information than you can get with other measurements, like averages.

BPF programs can create histograms based on many metrics. You can use BPF maps to collect the information, classify it in buckets, and then generate the histogram representation for your data. Implementing this logic is not complicated, but it becomes tedious if you want to print histograms every time you need to analyze a program’s output. BCC includes an implementation out of the box that you can reuse in every program, without having to calculate bucketing and frequency manually every single time. However, the kernel source has a fantastic implementation that we encourage you to check out in the BPF samples.

As a fun experiment, we’re going to show you how to use BCC’s histograms to visualize the latency introduced by loading BPF programs when an application calls the bpf_prog_load instruction. We use kprobes to collect how long it takes for that instruction to complete, and we’ll accumulate the results in a histogram that we’ll visualize later. We’ve divided this example into several parts to make it easier to follow.

This first part includes the initial source for our BPF program:

bpf_source = """
#include <uapi/linux/ptrace.h>

BPF_HASH(cache, u64, u64);
BPF_HISTOGRAM(histogram);

int trace_bpf_prog_load_start(void ctx) {		1
  u64 pid = bpf_get_current_pid_tgid();			2
  u64 start_time_ns = bpf_ktime_get_ns();
  cache.update(&pid, &start_time_ns);			3
  return 0;
}
"""
1

Use a macro to create a BPF hash map to store the initial time when the bpf_prog_load instruction is triggered.

2

Use a new macro to create a BPF histogram map. This is not a native BPF map; BCC includes this macro to make it easier for you to create these visualizations. Under the hood, this BPF histogram uses array maps to store the information. It also has several helpers to do the bucketing and create the final graph.

3

Use the programs PID to store when the application triggers the instruction we want to trace. (This function will look familiar to you—we took it from the previous Uprobes example.),

Let’s see how we calculate the delta for the latency and store it in our histogram. The initial lines of this new block of code will also look familiar because we’re still following the example we talked about in “Uprobes”.

bpf_source += """
int trace_bpf_prog_load_return(void ctx) {
  u64 *start_time_ns, delta;
  u64 pid = bpf_get_current_pid_tgid();
  start_time_ns = cache.lookup(&pid);
  if (start_time_ns == 0)
    return 0;

  delta = bpf_ktime_get_ns() - *start_time_ns;		1
  histogram.increment(bpf_log2l(delta));		2
  return 0;
}
"""
1

Calculate the delta between the time the instruction was invoked and the time it took our program to arrive here; we can assume that it’s also the time the instruction completed.

2

Store that delta in our histogram. We do two operations in this line. First, we use the built-in function bpf_log2l to generate the bucket identifier for the value of the delta. This function creates a stable distribution of values over time. Then, we use the increment function to add a new item to this bucket. By default, increment adds 1 to the value if the bucket existed in the histogram, or it starts a new bucket with the value of 1, so you don’t need to worry about whether the value exists in advance.

The last piece of code that we need to write attaches these two functions to the valid kprobes and prints the histogram on the screen so that we can see the latency distribution. This section is where we initialize our BPF program and we wait for events to generate the histogram:

bpf = BPF(text = bpf_source)			1
bpf.attach_kprobe(event = "bpf_prog_load",
    fn_name = "trace_bpf_prog_load_start")
bpf.attach_kretprobe(event = "bpf_prog_load",
    fn_name = "trace_bpf_prog_load_return")

try:						2
  sleep(99999999)
except KeyboardInterrupt:
  print()

bpf["histogram"].print_log2_hist("msecs")	3
1

Initialize BPF and attach our functions to kprobes.

2

Make our program wait so that we can gather as many events as we need from our system.

3

Print the histogram map in our terminal with the traced distribution of events—this is another BCC macro that allows us to get the histogram map.

As we mentioned at the beginning of this section, histograms can be useful to observe anomalies in your system. The BCC tools include numerous scripts that make use of histograms to represent data; we highly recommend you take a look at them when you need inspiration to dive into your system.

Perf Events

We believe that Perf events are probably the most important communication method that you need to master to use BPF tracing successfully. We talked about BPF Perf event array maps in the previous chapter. They allow you to put data in a buffer ring that’s synchronized in real time with user-space programs. This is ideal when you’re collecting a large amount of data in your BPF program and want to offload processing and visualization to a user-space program. That will allow you to have more control over the presentation layer because you’re not restricted by the BPF VM regarding programming capabilities. Most of the BPF tracing programs that you can find use Perf events only for this purpose.

Here, we show you how to use them to extract information about binary execution and classify that information to print which binaries are the most executed in your system. We’ve divided this example into two blocks of code so that you can easily follow the example. In the first block, we define our BPF program and attach it to a kprobe, like we did in “Probes”:

bpf_source = """
#include <uapi/linux/ptrace.h>

BPF_PERF_OUTPUT(events);		1

int do_sys_execve(struct pt_regs *ctx, void filename, void argv, void envp) {
  char comm[16];
  bpf_get_current_comm(&comm, sizeof(comm));

  events.perf_submit(ctx, &comm, sizeof(comm));		2
  return 0;
}
"""

bpf = BPF(text = bpf_source)						3
execve_function = bpf.get_syscall_fnname("execve")
bpf.attach_kprobe(event = execve_function, fn_name = "do_sys_execve")

In the first line of this example, we’re importing a library from Python’s standard library. We’re going to use a Python counter to aggregate the events we’re receiving from our BPF program.

1

Use BPF_PERF_OUTPUT to declare a Perf events map. This is a convinient macro that BCC provides to declare this kind of map. We’re naming this map events.

2

Send it to user-space for aggregation after we have the name of the program that the kernel has executed. We do that with perf_submit. This function updates the Perf events map with our new piece of information.

3

Initialize the BPF program and attach it to the kprobe to be triggered when a new program is executed in our system.

Now that we have written the code to collect all programs that are executed in our system, we need to aggregate them in user-space. There is a lot of information in the next code snippet, so we’re going to walk you through the most important lines:

from collections import Counter
aggregates = Counter()			1

def aggregate_programs(cpu, data, size):	2
  comm = bpf["events"].event(data)		 3
  aggregates[comm] += 1

bpf["events"].open_perf_buffer(aggregate_programs)	4
while True:
    try:
      bpf.perf_buffer_poll()
    except KeyboardInterrupt:				5
      break

for (comm, times) in aggregates.most_common():
  print("Program {} executed {} times".format(comm, times))
1

Declare a counter to store our program information. We use the name of the program as the key, and the values will be counters. We use the aggregate_programs function to collect the data from the Perf events map. In this example, you can see how we use the BCC macro to access the map and extract the next incoming data event from the top of the stack.

2

Increment the number of times we’ve received an event with the same program name.

3

Use the function open_perf_buffer to tell BCC that it needs to execute the function aggregate_programs every time it receives an event from the Perf events map.

4

BCC polls events after opening the ring buffer until we interrupt this Python program. The longer you wait, the more information you’re going to process. You can see how we use perf_buffer_poll for this purpose.

5

Use the most_common function to get the list of elements in the counter and loop to print the top executed programs in your system first.

Perf events can open the door to processing all of the data that BPF exposes in novel and unexpected ways. We’ve shown you an example to inspire your imagination when you need to collect some kind of arbitrary data from the kernel; you can find many other examples in the tools that BCC provides for tracing.

Conclusion

In this chapter we’ve only scratched the surface of tracing with BPF. The Linux kernel gives you access to information that’s more difficult to obtain with other tools. BPF makes this process more predictable because it provides a common interface to access this data. In later chapters you’ll see more examples that use some of the techniques described here, such as attaching functions to tracepoints. They will help you cement what you’ve learned here.

We used the BCC framework in this chapter to write most of the examples. You can implement the same examples in C, like we did in previous chapters, but BCC provides several built-in features that make writing tracing programs much more accessible than C. If you’re up for a fun challenge, try rewriting these examples in C.

In the next chapter, we show you some tools that the systems community has built on top of BPF to do performance analysis and tracing. Writing your own programs is powerful, but these dedicated tools give you access to much of the information we’ve seen here in packaged format. This way, you don’t need to rewrite tools that already exist.

Get Linux Observability with BPF now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.