GNOME Bugzilla – Bug 782407
missing stack traces when recording
Last modified: 2021-07-01 18:12:45 UTC
In some situations, we are vastly missing symbols in the callgraph. I think that is limited to "single program recording", but we need to verify this. Here are some examples: http://www-rocq.inria.fr/~lasgoutt/sysprof-lyx.png http://www-rocq.inria.fr/~lasgoutt/perf-record.png What we need to figure out next is: - Were the sysprof and perf recordings both recording just the target program and not the whole system. It *appears* to me that I often get reduced information when recording just the single program as opposed to the whole system (and I'm not sure why). - Did the capture contain all the symbols, but we just failed to present them properly (this helps us nail down whether the bug is in capture or is it in compute/display).
Created attachment 351613 [details] xzipped perf data This perf data corresponds to the screen shot referred to in the bug description. LyX was compiled using options -g -O2 -std=c++14 -fno-omit-frame-pointer Then it was lauched and "perf record" was attached to its PID using sudo perf record -g -p PID
When recording with sysprof, did you use sudo by chance? For example: sudo sysprof-cli -p PID I ask because when not using sudo, we have to ask a system service to elevate our privileges (and pass the perf FD back). As you can imagine that expands the surface area for me to explore.
I used the GUI for capturing the samples. I did have to provide my password at some point. I can retry with sudo sysprof-cli, but this will be next week.
Sure That means we are asking sysprofd to do our __NR_perf_event_open syscall, and handing us back an open fd. That should probably be enough info for me to replicate as soon as I manage to find some free time to hack on Sysprof.
Is there some additional testing I could attempt to help your diagnosis?
FWIW, I tried again with sysprof-cli -p (which asked me for a password), and the capture.syscap file that was produced lead to the same issue when loading it in sysprof.
I think the main thing preventing me from digging in right now is just lack of time. I am curious to know if you get different information simply by doing whole-system recording (omit -p). You'll of course get other application information, but I want to know if the amount of information you get on your process also improves.
I just tried it, and the result is the same. Only fjes_hw_epbuf_tx_pkt_send appears under src/lyx.
The funny thing (or a sad, I do not know anymore) is that _all_ the processes are missing a proper stack trace. This is not related at all to the program that I am trying to profile. Could it be a bad (or unexpected) setting in ubuntu?
Most distributions out-of-the box have misguided compilation settings (in my opinion). They generally compile with frame pointers disabled because on 32-bit x86 everything needed to be stashed on the stack. So it actually was a non-trivial performance improvement. But on x86_64, it's just not the case. You maybe, if you're lucky, snag a .5% speed-up. But alas, they tend to do it anyway, making it difficult to get fast, reliable stack traces. What I use Sysprof for mostly, is profiling the GNOME stack while we are developing it. Which means I've built most of the core components in JHBuild (just some fancy python build wrangler scripts) and change all those settings to something more reasonable. (I generally set -fno-omit-frame-pointers and -O0, but the later is not as important). But what I found interesting is that you had different data when profiling with perf. That means we have an issue in one of a couple of areas. - We are calling the perf_event_open syscall differently than the perf command line too. (This is likely to some degree, but how much?) - We are falling behind the mmap()'d ring buffer we communicate with perf over and therefore are missing samples. (Seems unlikely to me). - We are failing to resolve the instruction-pointers when generating the callgraph and therefore they get lost or combined into "In file *". This one has more moving parts, because there is a bunch of trickery going on to locate the proper ELF. We have some symbol directories to look through as well as cracking open the ELF and finding the build-id field w/ some CRC checks. Recently (in 3.24.0 I believe) I added support to locate symbols from inside of containers (when we see /newroot/ in the path) and this was rather complicated set of heuristics. It's possible there is a regression there. If you're running on something older than 3.24 we can rule the /newroot/ stuff out.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/sysprof/-/issues/ Thank you for your understanding and your help.