I've been taking some time to study the BPF Performance Tools book by Brendan Gregg.
There's a been a lot of partial resources and tips about using BPF, and it has already helped me to find an epic memory leak. I always wanted to get a chance to learn it deeper and the release of Gregg's book was a good cause. My colleague from Shopify's ProdEng team Dale Hamel helped to review the book too, which only pumped my interest about the technology.
When it comes to observability, the most impact (IMO) from being able to trace something really quickly and find the root cause is during incidents, when it's not clear what's happening and pulling the right observability tool from your hat is the top skill. That's the muscle that I'd want to extract from the book, and writing notes in "what I've learned" format will hopefully help.
This post is a collection of notes taken while getting my hands dirty with each of the tools, as the book goes.
0. Getting bpfcc-tools ready
Turns out that apt-get install bpfcc-tools
on Ubuntu 18.04 Bionic is broken, as I'm writing this in December 2019. Thankfully it's fixed in the upstream, and I documented steps how to install a nightly package.
1. execsnoop
execsnoop
prints all processes that were launched (aka trace exec()
syscalls). In the real life, it might come handy to inspect external calls coming from the application server (for instance, imagemagick
).
Steps I've taken:
-
Open a terminal tab, SSH to the devbox, run
sudo /usr/share/bcc/tools/execsnoop
-
SSH to the same VM from another terminal tab, observe
execsnoop
in tab 1) printing all the commands that were executed during the login:
PCOMM PID PPID RET ARGS sshd 5429 1430 0 /usr/sbin/sshd -D -R sh 5431 5429 0 env 5432 5431 0 /usr/bin/env -i PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin run-parts --lsbsysinit /etc/update-motd.d run-parts 5432 5431 0 /bin/run-parts --lsbsysinit /etc/update-motd.d 00-header 5433 5432 0 /etc/update-motd.d/00-header uname 5434 5433 0 /bin/uname -o uname 5435 5433 0 /bin/uname -r uname 5436 5433 0 /bin/uname -m 10-help-text 5437 5432 0 /etc/update-motd.d/10-help-text 50-landscape-sy 5438 5432 0 /etc/update-motd.d/50-landscape-sysinfo grep 5439 5438 0 /bin/grep -c ^processor /proc/cpuinfo cut 5443 5441 0 /usr/bin/cut -f1 -d /proc/loadavg bc 5442 5440 0 /usr/bin/bc date 5444 5438 0 /bin/date landscape-sysin 5445 5438 0 /usr/bin/landscape-sysinfo ldconfig 5446 5445 0 /sbin/ldconfig -p ldconfig.real 5446 5445 0 /sbin/ldconfig.real -p ldconfig 5447 5445 0 /sbin/ldconfig -p ldconfig.real 5447 5445 0 /sbin/ldconfig.real -p who 5449 5445 0 /usr/bin/who -q 50-motd-news 5450 5432 0 /etc/update-motd.d/50-motd-news cat 5451 5450 0 /bin/cat /var/cache/motd-news tr 5453 5450 0 /usr/bin/tr -d \000-\011\013\014\016-\037 head 5452 5450 0 /usr/bin/head -n 10 cut 5454 5450 0 /usr/bin/cut -c -80 80-esm 5455 5432 0 /etc/update-motd.d/80-esm lsb_release 5456 5455 0 /usr/bin/lsb_release -cs lsb_release 5457 5455 0 /usr/bin/lsb_release -ds 80-livepatch 5458 5432 0 /etc/update-motd.d/80-livepatch 90-updates-avai 5459 5432 0 /etc/update-motd.d/90-updates-available cat 5460 5459 0 /bin/cat /var/lib/update-notifier/updates-available 91-release-upgr 5461 5432 0 /etc/update-motd.d/91-release-upgrade cut 5464 5462 0 /usr/bin/cut -d -f4 lsb_release 5463 5462 0 /usr/bin/lsb_release -sd release-upgrade 5461 5432 0 /usr/lib/ubuntu-release-upgrader/release-upgrade-motd date 5465 5461 0 /bin/date +%s stat 5466 5461 0 /usr/bin/stat -c %Y /var/lib/ubuntu-release-upgrader/release-upgrade-available expr 5467 5461 0 /usr/bin/expr 1577101965 + 86400 95-hwe-eol 5468 5432 0 /etc/update-motd.d/95-hwe-eol update-motd-hwe 5468 5432 0 /usr/lib/update-notifier/update-motd-hwe-eol apt-config 5469 5468 0 /usr/bin/apt-config shell StateDir Dir::State dpkg 5470 5469 0 /usr/bin/dpkg --print-foreign-architectures apt-config 5471 5468 0 /usr/bin/apt-config shell ListDir Dir::State::Lists dpkg 5472 5471 0 /usr/bin/dpkg --print-foreign-architectures apt-config 5473 5468 0 /usr/bin/apt-config shell DpkgStatus Dir::State::status dpkg 5474 5473 0 /usr/bin/dpkg --print-foreign-architectures apt-config 5475 5468 0 /usr/bin/apt-config shell EtcDir Dir::Etc dpkg 5476 5475 0 /usr/bin/dpkg --print-foreign-architectures apt-config 5477 5468 0 /usr/bin/apt-config shell SourceList Dir::Etc::sourcelist dpkg 5478 5477 0 /usr/bin/dpkg --print-foreign-architectures find 5479 5468 0 /usr/bin/find /var/lib/apt/lists/ /etc/apt/sources.list //var/lib/dpkg/status -type f -newer /var/lib/update-notifier/hwe-eol -print -quit dirname 5481 5480 0 /usr/bin/dirname /var/lib/update-notifier/hwe-eol mktemp 5480 5468 0 /bin/mktemp -p /var/lib/update-notifier cat 5482 5468 0 /bin/cat /var/lib/update-notifier/hwe-eol rm 5483 5468 0 /bin/rm -f /var/lib/update-notifier/tmp.RjVbtK5reS 97-overlayroot 5484 5432 0 /etc/update-motd.d/97-overlayroot egrep 5486 5485 0 /bin/egrep overlayroot|/media/root-ro|/media/root-rw /proc/mounts sort 5487 5485 0 /usr/bin/sort -r grep 5486 5485 0 /bin/grep -E overlayroot|/media/root-ro|/media/root-rw /proc/mounts 98-fsck-at-rebo 5488 5432 0 /etc/update-motd.d/98-fsck-at-reboot update-motd-fsc 5488 5432 0 /usr/lib/update-notifier/update-motd-fsck-at-reboot stat 5489 5488 0 /usr/bin/stat -c %Y /var/lib/update-notifier/fsck-at-reboot awk 5491 5490 0 /usr/bin/awk {print $1} /proc/uptime date 5490 5488 0 /bin/date -d now - 1614.44 seconds +%s date 5492 5488 0 /bin/date +%s cat 5493 5488 0 /bin/cat /var/lib/update-notifier/fsck-at-reboot 98-reboot-requi 5494 5432 0 /etc/update-motd.d/98-reboot-required update-motd-reb 5494 5432 0 /usr/lib/update-notifier/update-motd-reboot-required bash 5496 5495 0 /bin/bash groups 5498 5497 0 /usr/bin/groups locale-check 5500 5499 0 /usr/bin/locale-check C.UTF-8 locale 5501 5496 0 /usr/bin/locale lesspipe 5504 5503 0 /usr/bin/lesspipe basename 5505 5504 0 /usr/bin/basename /usr/bin/lesspipe dirname 5507 5506 0 /usr/bin/dirname /usr/bin/lesspipe dircolors 5509 5508 0 /usr/bin/dircolors -b
Turns out there's a lot of stuff involved when you ssh and login into a terminal session.
2. biolatency
biolatency
prints stats about I/O latency as a diagram. In the past, I've seen my colleagues using it to debug slow disks in cloud, which might be especially critical for latency-sensitive workloads like DBs.
Steps I've taken:
-
Install
fio
(I/O tester) on a VM with standard disk (pd-standard
on Google Cloud) and run a sample test:$ fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
-
Run
biolatency
in another tab:
$ sudo /usr/share/bcc/tools/biolatency -D Tracing block device I/O... Hit Ctrl-C to end. ^C disk = 'sda' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 1 | | 32 -> 63 : 14 | | 64 -> 127 : 7 | | 128 -> 255 : 1 | | 256 -> 511 : 0 | | 512 -> 1023 : 8 | | 1024 -> 2047 : 9 | | 2048 -> 4095 : 30 | | 4096 -> 8191 : 49 | | 8192 -> 16383 : 172 | | 16384 -> 32767 : 298 |* | 32768 -> 65535 : 395 |* | 65536 -> 131071 : 758 |*** | 131072 -> 262143 : 1290 |***** | 262144 -> 524287 : 9138 |****************************************|
The latency seems pretty high!
- Provision another VM with a local SSD and re-run the command:
disk = 'nvme0n1' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 47812 |**** | 64 -> 127 : 171016 |*************** | 128 -> 255 : 448958 |****************************************| 256 -> 511 : 136858 |************ | 512 -> 1023 : 50191 |**** | 1024 -> 2047 : 69666 |****** | 2048 -> 4095 : 57147 |***** | 4096 -> 8191 : 7615 | | 8192 -> 16383 : 246 | | 16384 -> 32767 : 47 | | 32768 -> 65535 : 41 | | 65536 -> 131071 : 120 | | 131072 -> 262143 : 298 | | 262144 -> 524287 : 511 | | 524288 -> 1048575 : 1263 | | 1048576 -> 2097151 : 1658 | | 2097152 -> 4194303 : 152 | |
As expected, switching to SSD gives much lower latency.
3. opensnoop
opensnoop
traces all files that were open on the system (aka trace open()
syscalls).
Steps I've taken:
- In one tab, run
sudo /usr/share/bcc/tools/opensnoop
(optionally pass-n
to scope it to a specific app) - In another tab, run some user app (in my case it was ProxySQL since I've already had it compiled on that devbox)
- Observe files that ProxySQL attempted to open:
$ sudo /usr/share/bcc/tools/opensnoop -n proxysql PID COMM FD ERR PATH 2470 proxysql 3 0 /etc/ld.so.cache 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libgnutls.so.30 2470 proxysql 3 0 /lib/x86_64-linux-gnu/libpthread.so.0 2470 proxysql 3 0 /lib/x86_64-linux-gnu/librt.so.1 2470 proxysql 3 0 /lib/x86_64-linux-gnu/libdl.so.2 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libstdc++.so.6 2470 proxysql 3 0 /lib/x86_64-linux-gnu/libm.so.6 2470 proxysql 3 0 /lib/x86_64-linux-gnu/libgcc_s.so.1 2470 proxysql 3 0 /lib/x86_64-linux-gnu/libc.so.6 2470 proxysql 3 0 /lib/x86_64-linux-gnu/libz.so.1 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libp11-kit.so.0 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libidn2.so.0 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libunistring.so.2 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libtasn1.so.6 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libnettle.so.6 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libhogweed.so.4 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libgmp.so.10 2470 proxysql 3 0 /usr/lib/x86_64-linux-gnu/libffi.so.6 2470 proxysql 3 0 /proc/sys/vm/overcommit_memory 2470 proxysql 3 0 /sys/kernel/mm/transparent_hugepage/enabled 2470 proxysql 3 0 /etc/nsswitch.conf 2470 proxysql 3 0 /etc/ld.so.cache 2470 proxysql -1 2 /lib/x86_64-linux-gnu/tls/x86_64/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64-linux-gnu/tls/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64-linux-gnu/tls/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64-linux-gnu/tls/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64-linux-gnu/x86_64/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64-linux-gnu/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64-linux-gnu/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64-linux-gnu/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64-linux-gnu/tls/x86_64/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64-linux-gnu/tls/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64-linux-gnu/tls/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64-linux-gnu/tls/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64-linux-gnu/x86_64/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64-linux-gnu/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64-linux-gnu/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64-linux-gnu/libnss_db.so.2 2470 proxysql -1 2 /lib/tls/x86_64/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/tls/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/tls/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/tls/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/x86_64/libnss_db.so.2 2470 proxysql -1 2 /lib/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/tls/x86_64/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/tls/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/tls/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/tls/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/x86_64/libnss_db.so.2 2470 proxysql -1 2 /usr/lib/libnss_db.so.2 2470 proxysql 3 0 /etc/ld.so.cache 2470 proxysql 3 0 /lib/x86_64-linux-gnu/libnss_files.so.2 2470 proxysql 3 0 /etc/services 2470 proxysql -1 2 /usr/local/ssl/openssl.cnf 2470 proxysql 3 0 /etc/localtime 2470 proxysql 3 0 /dev/urandom 2470 proxysql 3 0 /home/kir/proxysql-key.pem 2470 proxysql 3 0 /home/kir/proxysql-cert.pem 2470 proxysql 3 0 /usr/bin/proxysql 2470 proxysql 4 0 /home/kir/proxysql.pid
Under the hood, opensnoop
is just a fancy wrapper around sudo bpftrace -e 'tracepoint:syscalls:sys_enter_open*
.
However, tracing open()
syscalls only shows files that were opened, but not files that were attempted to open but didn't exist. This might come useful if you want to trace an app that's failing due some config file that doesn't exist.
A common scanerio could be:
if File.exist?(path/to/config) File.open(path/to/config) else # some other behaviour end
In my case, it was ProxySQL failing to start because it couldn't find the config file:
$ proxysql 2019-12-25 18:26:58 main.cpp:829:ProxySQL_Main_process_global_variables(): [WARNING] Unable to open config file /etc/proxysql.cnf [Warning]: Cannot open any default config file . Using default datadir in current working directory /home/kir ...
Let's find a syscall that attempts (and fails) to read /etc/proxysql.cnf
. From what I found, it's the access
syscall that is used to check for file existance. It was easy enough to come up with a BPF instruction.
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_access { printf("%s %s\n", comm, str Attaching 1 probe... proxysql /etc/ld.so.preload proxysql /etc/proxysql.cnf proxysql /home/kir/proxysql-key.pem proxysql /home/kir/proxysql-cert.pem proxysql /home/kir/proxysql-ca.pem
Also at this point, I've started to peak and learn syntax for bpftrace
, as opposed to user-friendly tools around it like opensnoop
.
Maybe I could even find the stacktrace?
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_access { printf("%s %s %s\n", comm, str(args->filename), ustack); }' Attaching 1 probe... proxysql /etc/ld.so.preload 0x7f4cdaf6e9d7 0x7f4cdaf6c9ef 0x40 proxysql /etc/proxysql.cnf access+7 0x6e632e6c71737978 ...
Not much info from the stack trace anyways. Debugging symbols must be missing. Let's try to recompile ProxySQL with make debug
:
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_access { printf("%s %s %s\n", comm, str(args->filename), ustack); }' proxysql /etc/ld.so.preload 0x7f95ff32f9d7 0x7f95ff32d9ef 0x40 proxysql /etc/proxysql.cnf 0x7f95fed0a0e7 _ZN19ProxySQL_ConfigFile8OpenFileEPKc+94 _Z38ProxySQL_Main_process_global_variablesiPPKc+125 main+115 0x7f95fec23b6b 0x41fd89415541f689
Yay! Now we can know that it's the ProxySQL_ConfigFile8OpenFileEPK
function that checks for existance of /etc/proxysql.cnf
using access()
syscall.
Inspecting failed DNS queries
I wanted to play more with the raw bpftrace -e
and I came up with an actual problem from production that would be interesting to debug.
At work, I observed a tiny percentage of RPCs failing due DNS lookups timing out. I don't have an idea yet why that might be happening, but with BPF I could at least observe those events better.
DNS resolution is done by getaddrinfo(3)
. I found gethostlatency.bt in bpftrace samples code which observes DNS resolution latency - something close to my case.
Here is the gist of it:
uprobe:/lib/x86_64-linux-gnu/libc.so.6:getaddrinfo, uprobe:/lib/x86_64-linux-gnu/libc.so.6:gethostbyname, uprobe:/lib/x86_64-linux-gnu/libc.so.6:gethostbyname2 { @start[tid] = nsecs; @name[tid] = arg0; } uretprobe:/lib/x86_64-linux-gnu/libc.so.6:getaddrinfo, uretprobe:/lib/x86_64-linux-gnu/libc.so.6:gethostbyname, uretprobe:/lib/x86_64-linux-gnu/libc.so.6:gethostbyname2 /@start[tid]/ { $latms = (nsecs - @start[tid]) / 1000000; time("%H:%M:%S "); printf("%-6d %-16s %6d %s\n", pid, comm, $latms, str(@name[tid])); delete(@start[tid]); delete(@name[tid]); }
The way I understand it, it does the following:
- Capture the start of a user-land probe
getaddrinfo/gethostbyname/gethostbyname2
- Record the timestamp and the host
- Capture the end of
getaddrinfo/gethostbyname/gethostbyname2
probe - Print the summary about the trace, including the delta of timestamps
What I'm looking for, is to only print failed DNS lookups. Luckily, the end capture (uretprobe
) contains the retval
(return value) field. According to getaddrinfo(3)
docs, non-zero return code means it failed. Let's filter by that:
BEGIN { printf("Tracing getaddr/gethost calls... Hit Ctrl-C to end.\n"); printf("%-9s %-6s %-16s %6s %s\n", "TIME", "PID", "COMM", "LATms", "HOST"); } uprobe:/lib/x86_64-linux-gnu/libc.so.6:getaddrinfo, uprobe:/lib/x86_64-linux-gnu/libc.so.6:gethostbyname, uprobe:/lib/x86_64-linux-gnu/libc.so.6:gethostbyname2 { @name[tid] = arg0; } uretprobe:/lib/x86_64-linux-gnu/libc.so.6:getaddrinfo, uretprobe:/lib/x86_64-linux-gnu/libc.so.6:gethostbyname, uretprobe:/lib/x86_64-linux-gnu/libc.so.6:gethostbyname2 { if (retval != 0 ) { // for some reason retval < 0 didn't work in if statement time("%H:%M:%S "); printf("failed | %-6d %-16s (error: %d) %s\n", pid, comm, retval, str(@name[tid])); }; delete(@name[tid]); }
Now, how do I locally reproduce a failing DNS lookup? The way to stub that I've learned earlier it is to point resolv to a non-existing DNS server:
// find your original DNS server
$ cat /etc/resolv.conf |grep -i '^nameserver'|head -n1|cut -d ' ' -f2
// to take DNS down and point it to 127.0.0.1:
$ echo "$(sed 's/<your-original-DNS-server>/127.0.0.1/g' /etc/resolv.conf)" > /etc/resolv.conf
// to take DNS back up:
$ echo "$(sed 's/127.0.0.1/<your-original-DNS-server>/g' /etc/resolv.conf)" > /etc/resolv.conf
Now, with the probe running, we can try to hit curl google.com
(which will fail on name lookup) and see the following output from bpftrace
:
20:28:19 failed | 24087 curl (error -3) google.com
Hurray!
There's probably another story for how to long-run that safely in production environment, which I'm yet to learn.
These are my notes for what I've learned so far from the BPF book, and I'm only 1/4 way through. Let's see what the rest of the book brings.
The goal of publishing this is mostly for me to be able to come back and revisit some steps from what I've documented, but I hope it could be useful for others too.
I'm eager to learn if there's anything here that I misunderstood or something that could be done better - please feel free to contact me on Twitter and point it out.
Also, thanks to my friend Javier Honduco for walking me over and explaining some of those things I've learned.