I'm currently trying to get my hands on profiling. My development box runs F20 (Linux) and unfortunately, CodeXL crashes while collecting the events. So, I've been forced to employ Linux perf to monitor the raw event selectors extracted from "BKDG for AMD Familiy 10h Processors".
Anyways, I don't think that the method of collecting the stats is important here, I just mentioned it for completeness.
According to my perf runs, I'm faced with some code that has got (in my opinion) a relatively high number of "Dispatch Stalls" (event selector 0xd1) of 10-15% of all cycles. Further investigation showed that these are "Dispatch Stall for Reservation Station Full" (event selector 0xd6).
"Instruction fetch stalls" (event selector 0x87) are relatively large, too: more than 50% of all cycles. The number of "Decoder Empty" (event selector 0xd0) events is negligible though.
L1 misses (*TLB, dcache, icache) and branch mispredictions are near to zero. The code doesn't employ the FPU, neither directly, nor through some SSE or whatever stuff (verified by assembly listing).
Since the "Decoder Empty" events are relatively rare, I assume, that the "Instruction fetch stalls" do not cause the problem but are a consequence of the fact that the macro ops can't get dispatched fast enough to the integer execution unit.
First question: is this interpretion correct?
Now I wonder, what could cause these dispatch stalls.
If I read Appendix A.11.2 ("Integer Execution Unit") of "Software Optimization Guide for AMD Family 10h and 12h Processors" correctly, it means, that the Instruction Control Unit can't dispatch some macro op to any of the three integer schedulers since they all have their 8 entries already filled up. Or if the to-be-dispatched macro op is one of the special operations multiply, divide, LZCNT or POPCNT and that the single scheduler capable of handling this special instruction is full.
Second question: Is this understanding correct?
Now, what could actually cause the "Dispatch Stall for Reservation Station Full" events? Some integer operation with a large latency (perhaps bsf or bsr which take 4 cycles)?
Every 24th instruction is a bsf.
Also worth to mention might be that my code has a large amount of branches: 25-30% of all retired instructions are branch instructions (all direct).
Is the length of dependency chains significant for the dispatch stalls here?
I think not, otherwise I would see some "Dispatch Stall for Reorder Buffer Full"?
Since I'm an absolute newbie in interpreting performance counters, I would really appreciate your help!
Thanks a lot for your time and efforts,
Can you share some details regarding the CodeXL crash you observed? Which version of CodeXL are you running? Which Linux distribution and version? Where is the crash reported? Can you share a dump?
CodeXL 1.3.3487.0 running on Fedora 20 Beta (x86_64).
Reproduced crasher with profiling /bin/true.
As unprivileged user:
When doing IBS sampling, a popup entitled "CPU Profiling Error" appears saying
"The driver failed to start profiling. (error code 0x8000ffff)
Profiling as non-root user with current configuration requires
to be set to -1 by privilege users."
I actually have `perf_event_paranoid' set to -1. No crash here, just this error message.
No new messages show up in /var/log/audit/audit.log, so I guess it's not some SELinux related issue.
As root (because I trust you guys so much):
Crashes with uncaught std::bad_alloc
I have created a core dump at point of throwing (you have to switch to thread "6" to see it): https://www.hidrive.strato.com/lnk/VgFiOU2f (validity 10 days, 5 downloads).
$ md5sum /opt/AMD/CodeXL_1.3.3487/bin/CodeXL-bin /lib64/libstdc++.so.6
$ rpm -qf /lib64/libstdc++.so.6
Anyways, my intention on starting this thread was not so much to get CodeXL running, but to get answers to my questions on the interpretation of particular performance metrics.