EBPF: Intro

eBPF quickstart

What is eBPF

Way back in the day, some BSD people decided that they need an efficient way to filter packets, that is more granular and provides more flexibility than simple header-based rules. The Berkeley Packet Filter turned out to be a good idea, and eventually support for the BPF interface landed in the Linux kernel, as well as several other operating systems.

Since around the 3.18 release, the Linux kernel includes an extended version of this virtual machine, called eBPF. The extensions allow users to attach very limited programs to an increasing number of functions, in a way that's supposed to be secure and performant. These applications don't need to be networking-focused. Using eBPF, we can trace the entire system's file access patterns, memory allocations, or many other aspects.

Limitations

For efficiency reasons, the eBPF VM does not allow Turing complete programs to be run. The Cilium project has an exhaustive documentation on the internals of the BPF VM and specific limitations, however, the most prominent ones are the following:

No loops (instruction counter can only increase)
Only 4096 instructions
512 bytes of stack space
No global state (only through maps)
A program needs to exit (using return)
All mapped memory access needs to be checked compile time

While we can use C to write code for the BPF VM, only LLVM supports it as a target platform at the moment, documentation is mostly available in source code form, and it is often tricky to wade ourselves through the performance considerations of how to extract data from the kernel.

The tooling originally provided by the kernel has a sparse documentation, although bpf_dbg, bpf_asm, and bpftool are available under tools/bpf. These allow us to debug, compile from assembly, or manage BPF-related resources, respectively.

Possibly one of the more interesting libraries to explore can be found under tools/lib/bpf/libbpf.c, which can be used directly in any software licensed in a way that's compatible with LGPL 2. It includes a thin layer to load BPF program code from ELF binaries, as well as tools to manage the lifecycle of the resources.

To make things simpler, the BCC project also includes a component called libbpf, which provides a different set of functionality. Albeit there is some overlap, notably missing is the ELF parser, the interfaces are different, and the library provides bindings to many high-level operations in eBPF-land.

As a heavyweight framework, BCC provides a complete toolchain to work with BPF modules, complete with helper functions that make our BPF code more legible, and bindings for Python, Lua, and C++ that provide high-level interfaces to the compile-load-attach lifecycles, as well as other aspects of interacting with our in-kernel probes.

Maps

Since using global state directly within eBPF probes is not allowed, maps have to be used for this purpose. On top of that, maps will help a BPF probe communicate with userspace, too.

For ease of use, maps are usually declared as global variables in C source code, which then get their own sections in the ELF binary. Userspace has to initialise the kernel structures using the bpf_create_map() call, and follow the ELF relocation instructions to insert the resulting file descriptors to the BPF bytecode at the appropriate places.

There are a few types of maps that will be of interest to everyone, and there are others that predominantly going to be used in specific use-cases.

I'm going to focus on the basics here, but you quickly get the idea. For a full list, consult bpf.h and the examples in the kernel source:

BPF_MAP_TYPE_HASH: This is a basic hash table, with keys and values.
BPF_MAP_TYPE_PERCPU_HASH: This is a hash table that is core-local for performance, and only userspace has a full view on it.
BPF_MAP_TYPE_PROG_ARRAY: a jump table for the bpf_tail_call(). It has to be populated from userspace with program fds.
BPF_MAP_TYPE_LRU_HASH: A hash map with least recently used eviction policy.
BPF_MAP_TYPE_PERF_EVENT_ARRAY: Send events to userspace using the perf_event subsystem's ringbuffers using the bpf_perf_event_output call.

While we can always do some bookkeeping with the eBPF probes, sometimes hash maps are also a good way to share data with userspace. The bpf_get_next_key() call is exposed to allow iterating through the keys, and bpf_lookup_elem() can be used to copy the value into a target buffer.

Kprobe, socketfilter, XDP

There are quite a few types of eBPF modules, and every kernel release will increase the places where they can be used. It's extended BPF, after all.

The catch is that some of these probe types will take a different context object as parameter. They will also have different eBPF calls available, and sometimes different ways to access memory outside the BPF VM.

I am going to focus on the types I'm most familiar with, and the ones which will most likely help you get your feet wet with eBPF modules, because they are applicable to a wide range of use cases.

`BPF_PROG_TYPE_KPROBE`

A lot of BPF's use cases are in networking, but some more interesting possibilities open up when we use the perf_event interface to stream events into userspace.

Kprobes take a struct pt_regs * structure as their context object. Access to the function call parameters is done through the PT_REGS_PARMx() macros, however, depending on the hook points being used, only certain information is available. They actually come in two flavours: kprobes bind to the entry points of functions, while kretprobes bind to the exit points. The function arguments are only accessible for kprobes, while the return value is only available for a kretprobe. To get an overarching view on what happened in the function body, we need to use maps to store some state between the kprobe and kretprobe invocation.

In order to access any data that comes through in function parameters, the bpf_probe_read() call needs to be used to copy that into the BPF VM.

Since the stack space is limited to 512 bytes, and there are no function calls, it is often not possible to pull in some larger kernel structures, so creative use of pointers may be necessary to get the piece of data we need.

It's also worth pointing out that since function calls are not allowed, the stack space can't be cleared during the execution of a program. If anything of this effect is required, tail calls using bpf_tail_call() are a possible option, but carrying over state has to be done using maps, as function parameters can't be passed down to the callee.

`BPF_PROG_TYPE_XDP`

For the most efficient input packet filtering, the eXpress Data Path is going to be the best choice. One obvious gotcha here is that, well, it only works for incoming packets. We can't tap into what the machine sends through the monitored interface.

On the flipside, the packet buffer is mapped into the BPF VM's address space, so with some clever pointer arithmetics, we can parse any kind of incoming packet. Memory access, as always, needs to be statically checked, which is going to bump up the instruction count.

Because XDP probes can work as packet filters, the return value here actually makes a difference. XDP_PASS will allow a packet to go through, while XDP_DROP will, uhm, drop it. Just like with the Kprobes, it is possible to extract information at a high performance using the perf_event subsystem.

`BPF_PROG_TYPE_SOCKETFILTER`

The socket filters here are very different in their use from the previous two. While we can bind XDP to an interface, and Kprobes to a function call, socket filters work on all sorts of sockets.

One way is using a SOCK_RAW socket to tap into the traffic going through an interface, and forward that into another socket. Another way would be inserting a BPF filter to a simple, say, TCP socket that is going to receive HTTP traffic, to filter the traffic before it gets to our application.

Socket filters, curiously, cannot use the perf_event interface to send data into userspace, however, in the right configuration, they will send data through their own socket. The BPF filter can select which packets to forward to userspace, which then has to start parsing again, as no additional info can be sent along with the raw packet data.

Memory access is best done through the load_byte(), load_half(), and load_word() calls within the BPF filter, which also means that all access is checked, but we need to load things word-by-word. I have seen examples of more advanced memory management in socket filters, but I could never get anything past through the BPF verifier.

References

Cilium documentation
Ferris Ellis's eBPF, part 1 and part 2
BCC Compiler
GoBPF
BPF Internals I, II
search for keywords and constants on LWN.net, you will probably find something.
Linux kernel source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly