Yesterday I talked about using
perf to profile assembly
instructions. Today I learned how to make flame graphs with
today and it is THE BEST. I found this because
Graydon Hoare pointed me to Brendan
page on how to use perf.
Wait up! What’s
perf? I’ve talked about
strace a lot before (in
Debug your programs like they’re closed source).
strace lets you see which system calls a program is calling. But
what if you wanted to know
- how many CPU instructions it ran?
- How many L1 cache misses there were?
- profiling information for each assembly instruction?
strace only does system calls, and none of those things are system
calls. So it can’t tell you any of those things!
perf is a Linux tool that can tell you all of these things, and
more! Let’s run a quick example on the
bytesum program from yesterday.
[email protected] ~/w/howcomputer> perf stat ./bytesum_mmap *.mp4 Performance counter stats for './bytesum_mmap The Newsroom S01E04.mp4': 158.141639 task-clock # 0.994 CPUs utilized 22 context-switches # 0.139 K/sec 9 CPU-migrations # 0.057 K/sec 133 page-faults # 0.841 K/sec 438,662,273 cycles # 2.774 GHz [82.43%] 269,916,782 stalled-cycles-frontend # 61.53% frontend cycles idle [82.38%] 131,557,379 stalled-cycles-backend # 29.99% backend cycles idle [66.66%] 681,518,403 instructions # 1.55 insns per cycle # 0.40 stalled cycles per insn [84.88%] 130,568,804 branches # 825.645 M/sec [84.85%] 20,756 branch-misses # 0.02% of all branches [83.68%] 0.159154389 seconds time elapsed
This is super neat information, and there’s a lot more (see
list). But we can do even more fun things!
Flame graphs with perf
I wanted to profile my
bytesum program. But how do you even profile
C programs? Here’s a way to do it with
sudo perf record -g ./bytesum_mmap *.mp4 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
Here’s the SVG this gave me:
This is AMAZING. But what does it mean? Basically
interrupts the program and finds out where in the stack it is. The
width of each part of the stack in the graph above is the proportion
of samples that happened there. (so about 30% of the execution time
was spend in
main). I don’t know what the colour means here.
We can see that there are 3 big parts – there’s the
mmap call (on
the left), the main program execution (in the middle), and the
sys_exit part on the right. Apparently stopping my program takes a
long time! Neat!
But there’s more!
Is it really L1 cache misses? We can find out!
So yesterday I made a program with really bad memory access patterns (bytesum_stride.c), and I conjectured that it was way slower because it was causing way too many L1 cache misses.
perf, we can check if that’s actually true! Here are the
results (reformatted a bit to be more compact):
[email protected] ~/w/howcomputer> perf stat -e L1-dcache-misses,L1-dcache-loads ./bytesum_mmap *.mp4 17,175,214 L1-dcache-misses # 11.48% of all L1-dcache hits 149,568,438 L1-dcache-loads [email protected] ~/w/howcomputer> perf stat -e L1-dcache-misses,L1-dcache-loads ./bytesum_stride *.mp4 1000 1,031,902,483 L1-dcache-misses # 193.16% of all L1-dcache hits 534,219,219 L1-dcache-loads
So, uh, that’s really bad. We now have 60 times more L1 cache misses, and also 3 times more hits.
Other amazing things
- Go to Brendan Gregg’s perf page and read the whole thing. Also possibly everything he’s ever written. His recent post on strace is great too.
- The perf tutorial is pretty long, but I found it somewhat helpful.
- I spent a little bit of time running cachegrind with
valgrind --tool=cachegrind ./bytesum_mmap my_filewhich can give you possibly even more information about CPU caches than
perfcan. Still haven’t totally wrapped my head around this.
There are still so many things I don’t understand at all!