This article (with some edits) also appeared on LWN.
Introduction
BCC (BPF Compiler Collection) is a toolkit and a suite of kernel tracing tools that allow systems engineers to efficiently and safely get a deep understanding into the inner workings of a Linux system. Because they can’t crash the kernel, they are safer than kernel modules and can be run in production environments. Brendan Gregg has written nice tools and given talks showing the full power of eBPF based tools. Unfortunately, BCC has no support for a cross-development workflow. I define “cross-development” as a development workflow in which the development machine and the target machine running the developed code are different. Cross-development is very typical among Embedded System kernel developers who often develop on a powerful x86 host and then flash and test their code on SoCs (System on Chips) based on the ARM architecture. Not having a cross-development flow gives rise to several complications, lets go over them and discuss a solution called BPFd that cleverly addresses this issue.
In the Android kernel team, we work mostly on ARM64 systems, since most Android devices are on this architecture. BCC tools support on ARM64 systems has stayed broken for years. One of the reasons for this difficulty is with ARM64 inline assembler statements. Unavoidably, kernel header includes in BCC tools result in inclusion of asm headers which in the case of ARM64 has the potential of spewing inline asm ARM64 instructions causing major pains to LLVM’s BPF backend. Recently this issue got fixed by BPF inline asm support (these LLVM commits) and folks could finally run BCC tools on arm64, but..
In order for BCC tools to work at all, they need kernel sources. This is because most tools need to register callbacks
on the ever-changing kernel API in order to get their data. Such callbacks are registered using the
kprobe infrastructure. When a BCC tool is run, BCC switches its current directory
into the kernel source directory before compilation starts, and compiles the C program that embodies the BCC tool’s
logic. The C program is free to include kernel headers for kprobes
to work and to use kernel data structures.
Even if one were not to use kprobes
, BCC also implicity adds a common helpers.h
include directive whenever an eBPF
C program is being compiled, found in src/cc/export/helpers.h
in the BCC sources. This helpers.h
header uses the
LINUX_VERSION_CODE
macro to create a “version” section in the compiled output. LINUX_VERSION_CODE
is available only
in the specific kernel’s sources being targeted and is used during eBPF program loading to make sure the BPF program is
being loaded into a kernel with the right version. As you can see, kernel sources quickly become mandatory for compiling
eBPF programs.
In some sense this build process is similar to how external kernel modules are built. Kernel sources are large in size and often can take up a large amount of space on the system being debugged. They can also get out of sync, which may make the tools misbehave.
The other issue is Clang and LLVM libraries need to be available on the target being traced. This is because the tools compile the needed BPF bytecode which are then loaded into the kernel. These libraries take up a lot space. It seems overkill that you need a full-blown compiler infrastructure on a system when the BPF code can be compiled elsewhere and maybe even compiled just once. Further, these libraries need to be cross-compiled to run on the architecture you’re tracing. That’s possible, but why would anyone want to do that if they didn’t need to? Cross-compiling compiler toolchains can be tedious and stressful.
BPFd: A daemon for running eBPF BCC tools across systems
Sources for BPFd can be downloaded here.
Instead of loading up all the tools, compiler infrastructure and kernel sources onto the remote targets being traced and running BCC that way, I decided to write a proxy program named BPFd that receives commands and performs them on behalf of whoever is requesting them. All the heavily lifting (compilation, parsing of user input, parsing of the hash maps, presentation of results etc) is done by BCC tools on the host machine, with BPFd running on the target and being the interface to the target kernel. BPFd encapsulates all the needs of BCC and performs them - this includes loading a BPF program, creating, deleting and looking up maps, attaching a eBPF program to a kprobe, polling for new data that the eBPF program may have written into a perf buffer, etc. If it’s woken up because the perf buffer contains new data, it’ll inform BCC tools on the host about it, or it can return map data whenever requested, which may contain information updated by the target eBPF program.
Simple design
Before this work, the BCC tools architecture was as follows:
BPFd based invocations partition this, thus making it possible to do cross-development and execution of the tools across machine and architecture boundaries. For instance, kernel sources that the BCC tools depend on can be on a development machine, with eBPF code being loaded onto a remote machine. This partioning is illustrated in the following diagram:
The design of BPFd is quite simple, it expects commands on stdin
(standard input) and provides the results over
stdout
(standard output). Every command a single line always, no matter how big the command is. This allows easy
testing using cat
, since one could simply cat
a file with commands, and check if BPFd’s stdout
contain the
expected results. Results from a command, however can be multiple lines.
BPF maps are data structures that a BPF program uses to store data which can be
retrieved at a later time. Maps are represented by file descriptor returned by
the bpf
system call once the map has been successfully created.
For example, following is a command to BPFd for creating a BPF hashtable map.
BPF_CREATE_MAP 1 count 8 40 10240 0
And the result from BPFd is:
bpf_create_map: ret=3
Since BPFd is proxying the map creation, the file descriptor (3 in this example) is
mapped into BPFd's
file descriptor table. The file descriptor can be used later to
look up entries that the BPF program in the kernel may have created, or to clear all
entries in the map, as is done by tools that periodically clear the accounting done
by a BPF program.
The BPF_CREATE_MAP
command in this example tells BPFd to create a map named
count
with map type 1 (type 1 is a hashtable
map),
with a key size of 8 bytes and a value size of 40, maximum of 10240 entries and
no special flags. BPFd created a map and identified by file descriptor 3.
With the simple standard input/output design, it’s possible to write wrappers around BPFd to handle more advanced communication methods such as USB or Networking. As a part of my analysis work in the Android kernel team, I am communicating these commands over the Android Debug Bridge which interfaces with the target device over either USB or TCP/IP. I have shared several demos below.
Changes to the BCC project for working with BPFd
BCC needed several changes to be able to talk to BPFd over a remote connection. All these changes are available here and will be pushed upstream soon.
Following are all the BCC modifications that have been made:
Support for remote communication with BPFd such as over the network
A new remotes
module has been added to BCC tools with an abstraction that
different remote types, such as networking or USB must implement. This keeps
code duplication to a minimum. By implementing the functions
needed
for a remote, a new communication method can be easily added. Currently an
adb
remote and a process
remote are provided. The adb
remote is for
communication with the target device over USB or TCP/IP using the Android
Debug Bridge. The
process
remote is probably useful just for local testing. With the process
remote,
BPFd is forked on the same machine running BCC and communicates with it over
stdin
and stdout
.
Changes to BCC to send commands to the remote BPFd
libbpf.c is the
main C file in the BCC project that talks to the kernel for all things BPF.
This is illustrated in the diagram above. Inorder to make BCC perform BPF
operations on the remote machine instead of the local machine, parts of BCC
that make calls to the local libbpf.c
are now instead channeled to the remote
BPFd on the target. BPFd on the target then perform the commands on behalf of
BCC running locally, by calling into its copy of libbpf.c
.
One of the tricky parts to making this work is, not only calls to libbpf.c
but certain other paths need to be channeled to the remote machine. For example, to
attach to a tracepoint, BCC needs a list of all available tracepoints on the
system. This list has to be obtained on the remote system, not the local one and is
the exact reason why there exists the GET_TRACE_EVENTS
command in BPFd.
Making the kernel build for correct target processor architecture
When BCC compiles the C program encapsulated in a BCC tool into eBPF instructions, it assumes that the eBPF program will run on the same processor architecture that BCC is running on. This is incorrect especially when building the eBPF program for a different target.
Some time ago, before I started this project, I changed this when building the in-kernel eBPF samples (which are simple standalone samples and unrelated to BCC). Now, I have had to make a similar change to BCC so that it compiles the C program correctly for the target architecture.
Installation
Try it out for yourself! Follow the Detailed or Simple instructions. Also, apply this kernel patch to make it faster to run tools like offcputime. I am submitting this patch to LKML as we speak.
BPF Demos: examples of BCC tools running on Android
Running filetop
filetop
is a BCC tool which shows you all read/write I/O operations with a similar experience to the top
tool.
It refreshes every few seconds, giving you a live view of these operations.
Goto your bcc directory and set the environment variables needed. For Android running on Hikey960, I run:
joel@ubuntu:~/bcc# source arm64-adb.rc
which basically sets the following environment variables:
export ARCH=arm64
export BCC_KERNEL_SOURCE=/home/joel/sdb/hikey-kernel/
export BCC_REMOTE=adb
You could also use the convenient bcc-set script provided in BPFd sources to set these environment variables for you. Check INSTALL.md file in BPFd sources for more information.
Next I start filetop
:
joel@ubuntu:~/bcc# ./tools/filetop.py 5
This tells the tool to monitor file I/O every 5 seconds.
While filetop
is running, I start the stock email app in Android and the output looks like:
Tracing... Output every 5 secs. Hit Ctrl-C to end
13:29:25 loadavg: 0.33 0.23 0.15 2/446 2931
TID COMM READS WRITES R_Kb W_Kb T FILE
3787 Binder:2985_8 44 0 140 0 R profile.db
3792 m.android.email 89 0 130 0 R Email.apk
3813 AsyncTask #3 29 0 48 0 R EmailProvider.db
3808 SharedPreferenc 1 0 16 0 R AndroidMail.Main.xml
3792 m.android.email 2 0 16 0 R deviceName
3815 SharedPreferenc 1 0 16 0 R MailAppProvider.xml
3813 AsyncTask #3 8 0 12 0 R EmailProviderBody.db
2434 WifiService 4 0 4 0 R iface_stat_fmt
3792 m.android.email 66 0 2 0 R framework-res.apk
Notice the Email.apk being read by Android to load the email application, and then various other reads happening related to the email app. Finally, WifiService continously reads iface_state_fmt to get network statistics for Android accounting.
Running biosnoop
Biosnoop is another great tool shows you block level I/O operations (bio) happening on the system along with the latency
and size of the operation. Following is a sample output of running tools/biosnoop.py
while doing random things in the
Android system.
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000000000 jbd2/sdd13-8 2135 sdd W 37414248 28672 1.90
0.001563000 jbd2/sdd13-8 2135 sdd W 37414304 4096 0.43
0.003715000 jbd2/sdd13-8 2135 sdd R 20648736 4096 1.94
5.119298000 kworker/u16:1 3848 sdd W 11968512 8192 1.72
5.119421000 kworker/u16:1 3848 sdd W 20357128 4096 1.80
5.448831000 SettingsProvid 2415 sdd W 20648752 8192 1.70
Running hardirq
This tool measures the total time taken by different hardirqs in the systems. Excessive time spent in hardirq can result in poor real-time performance of the system.
joel@ubuntu:~/bcc# ./tools/hardirqs.py
Output:
Tracing hard irq event time... Hit Ctrl-C to end.
HARDIRQ TOTAL_usecs
wl18xx 232
dw-mci 1066
e82c0000.mali 8514
kirin 9977
timer 22384
Running biotop
Run biotop while launching the android Gallery app and doing random stuff:
joel@ubuntu:~/bcc# ./tools/biotop.py
Output:
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
4524 droid.gallery3d R 8 48 ? 33 1744 0.51
2135 jbd2/sdd13-8 W 8 48 ? 15 356 0.32
4313 kworker/u16:4 W 8 48 ? 26 232 1.61
4529 Jit thread pool R 8 48 ? 4 184 0.27
2135 jbd2/sdd13-8 R 8 48 ? 7 68 2.19
2459 LazyTaskWriterT W 8 48 ? 3 12 1.77
Open issues as of this writing
While most issues have been fixed, a few remain. Please check the issue tracker and contribute patches or help by testing.
Other usecases for BPFd
While the main usecase at the moment is easier use of BCC tools on cross-development models, another potential usecase that’s gaining interest is easy loading of a BPF program. The BPFd code can be stored on disk in base64 format and sent to bpfd using something as simple as:
joel@ubuntu:~/bpfprogs# cat my_bpf_prog.base64 | bpfd
In the Android kernel team, we are also expermenting for certain usecases that need eBPF, with loading a program with a forked BPFd instance, creating maps, and then pinning them for use at a later time once BPFd exits and then kill the BPFd fork since its done. Creating a separate process (fork/exec of BPFd) and having it load the eBPF program for you has the distinct advantage that the runtime-fixing up map file descriptors isn’t needed in the loaded eBPF machine instructions. In other words, the eBPF program’s instructions can be pre-determined and statically loaded. The reason for this convience is BPFd starts with the same number of file descriptors each time before the first map is created.
Conclusion
Building code for instrumentation on a different machine than the one actually running the debugging code is beneficial and BPFd makes this possible. Alternately, one could also write tracing code in their own kernel module on a development machine, copy it over to a remote target, and do similar tracing/debugging. However, this is quite unsafe since kernel modules can crash the kernel. On the other hand, eBPF programs are verified before they’re run and are guaranteed to be safe when loaded into the kernel, unlike kernel modules. Furthermore, the BCC project offers great support for parsing the output of maps, processing them and presenting results all using the friendly Python programming language. BCC tools are quite promising and could be the future for easier and safer deep tracing endeavours. BPFd can hopefully make it even more easier to run these tools for folks such as Embedded system and Android developers who typically compile their kernels on their local machine and run them on a non-local target machine.
If you have any questions, feel to reach out to me or drop me a note in the comments section.