EBPF: Intro
Way back in the day, some BSD people decided that they need an efficient way to filter packets, that is more granular and provides more flexibility than simple header-based rules. The Berkeley Packet Filter turned out to be a good idea, and eventually support for the BPF interface landed in the Linux kernel, as well as several other operating systems.
Since around the 3.18 release, the Linux kernel includes an extended version of this virtual machine, called eBPF. The extensions allow users to attach very limited programs to an increasing number of functions, in a way that's supposed to be secure and performant. These applications don't need to be networking-focused. Using eBPF, we can trace the entire system's file access patterns, memory allocations, or many other aspects.
For efficiency reasons, the eBPF VM does not allow Turing complete programs to be run. The Cilium project has an exhaustive documentation on the internals of the BPF VM and specific limitations, however, the most prominent ones are the following:
- No loops (instruction counter can only increase)
- Only 4096 instructions
- 512 bytes of stack space
- No global state (only through maps)
- A program needs to exit (using
return
) - All mapped memory access needs to be checked compile time
While we can use C to write code for the BPF VM, only LLVM supports it as a target platform at the moment, documentation is mostly available in source code form, and it is often tricky to wade ourselves through the performance considerations of how to extract data from the kernel.
The tooling originally provided by the kernel has a sparse documentation,
although bpf_dbg
, bpf_asm
, and bpftool
are available under tools/bpf
.
These allow us to debug, compile from assembly, or manage BPF-related resources,
respectively.
Possibly one of the more interesting libraries to explore can be found under
tools/lib/bpf/libbpf.c
, which can be used directly in any software licensed in
a way that's compatible with LGPL 2. It includes a thin layer to load BPF
program code from ELF binaries, as well as tools to manage the lifecycle of the
resources.
To make things simpler, the
BCC project also includes a
component called libbpf
, which provides a different set of functionality.
Albeit there is some overlap, notably missing is the ELF parser, the interfaces
are different, and the library provides bindings to many high-level operations
in eBPF-land.
As a heavyweight framework, BCC provides a complete toolchain to work with BPF modules, complete with helper functions that make our BPF code more legible, and bindings for Python, Lua, and C++ that provide high-level interfaces to the compile-load-attach lifecycles, as well as other aspects of interacting with our in-kernel probes.
Since using global state directly within eBPF probes is not allowed, maps have to be used for this purpose. On top of that, maps will help a BPF probe communicate with userspace, too.
For ease of use, maps are usually declared as global variables in C source code,
which then get their own sections in the ELF binary. Userspace has to initialise
the kernel structures using the bpf_create_map()
call, and follow the ELF
relocation instructions to insert the resulting file descriptors to the BPF
bytecode at the appropriate places.
There are a few types of maps that will be of interest to everyone, and there are others that predominantly going to be used in specific use-cases.
I'm going to focus on the basics here, but you quickly get the idea. For a full list, consult bpf.h and the examples in the kernel source:
-
BPF_MAP_TYPE_HASH
: This is a basic hash table, with keys and values. -
BPF_MAP_TYPE_PERCPU_HASH
: This is a hash table that is core-local for performance, and only userspace has a full view on it. -
BPF_MAP_TYPE_PROG_ARRAY
: a jump table for thebpf_tail_call()
. It has to be populated from userspace with program fds. -
BPF_MAP_TYPE_LRU_HASH
: A hash map with least recently used eviction policy. -
BPF_MAP_TYPE_PERF_EVENT_ARRAY
: Send events to userspace using theperf_event
subsystem's ringbuffers using thebpf_perf_event_output
call.
While we can always do some bookkeeping with the eBPF probes, sometimes hash
maps are also a good way to share data with userspace. The bpf_get_next_key()
call is exposed to allow iterating through the keys, and bpf_lookup_elem()
can
be used to copy the value into a target buffer.
There are quite a few types of eBPF modules, and every kernel release will increase the places where they can be used. It's extended BPF, after all.
The catch is that some of these probe types will take a different context object as parameter. They will also have different eBPF calls available, and sometimes different ways to access memory outside the BPF VM.
I am going to focus on the types I'm most familiar with, and the ones which will most likely help you get your feet wet with eBPF modules, because they are applicable to a wide range of use cases.
A lot of BPF's use cases are in networking, but some more interesting
possibilities open up when we use the perf_event
interface to stream events
into userspace.
Kprobes take a struct pt_regs *
structure as their context object. Access to
the function call parameters is done through the PT_REGS_PARMx()
macros,
however, depending on the hook points being used, only certain information is
available. They actually come in two flavours: kprobes bind to the entry points of
functions, while kretprobes bind to the exit points. The function arguments are
only accessible for kprobes, while the return value is only available for a
kretprobe. To get an overarching view on what happened in the function body, we
need to use maps to store some state between the kprobe and kretprobe invocation.
In order to access any data that comes through in function parameters, the
bpf_probe_read()
call needs to be used to copy that into the BPF VM.
Since the stack space is limited to 512 bytes, and there are no function calls, it is often not possible to pull in some larger kernel structures, so creative use of pointers may be necessary to get the piece of data we need.
It's also worth pointing out that since function calls are not allowed, the
stack space can't be cleared during the execution of a program. If anything of
this effect is required, tail calls using bpf_tail_call()
are a possible
option, but carrying over state has to be done using maps, as function
parameters can't be passed down to the callee.
For the most efficient input packet filtering, the eXpress Data Path is going to be the best choice. One obvious gotcha here is that, well, it only works for incoming packets. We can't tap into what the machine sends through the monitored interface.
On the flipside, the packet buffer is mapped into the BPF VM's address space, so with some clever pointer arithmetics, we can parse any kind of incoming packet. Memory access, as always, needs to be statically checked, which is going to bump up the instruction count.
Because XDP probes can work as packet filters, the return value here actually
makes a difference. XDP_PASS
will allow a packet to go through, while
XDP_DROP
will, uhm, drop it. Just like with the Kprobes, it is possible to
extract information at a high performance using the perf_event
subsystem.
The socket filters here are very different in their use from the previous two. While we can bind XDP to an interface, and Kprobes to a function call, socket filters work on all sorts of sockets.
One way is using a SOCK_RAW
socket to tap into the traffic going through an
interface, and forward that into another socket. Another way would be inserting
a BPF filter to a simple, say, TCP socket that is going to receive HTTP traffic,
to filter the traffic before it gets to our application.
Socket filters, curiously, cannot use the perf_event
interface to send data
into userspace, however, in the right configuration, they will send data through
their own socket. The BPF filter can select which packets to forward to
userspace, which then has to start parsing again, as no additional info can be
sent along with the raw packet data.
Memory access is best done through the load_byte()
, load_half()
, and
load_word()
calls within the BPF filter, which also means that all access is
checked, but we need to load things word-by-word. I have seen examples of more
advanced memory management in socket filters, but I could never get anything
past through the BPF verifier.
- Cilium documentation
- Ferris Ellis's eBPF, part 1 and part 2
- BCC Compiler
- GoBPF
- BPF Internals I, II
- search for keywords and constants on LWN.net, you will probably find something.
- Linux kernel source