I'm currently trying to get my hands on profiling. My development box runs F20 (Linux) and unfortunately, CodeXL crashes while collecting the events. So, I've been forced to employ Linux perf to monitor the raw event selectors extracted from "BKDG for AMD Familiy 10h Processors".
Anyways, I don't think that the method of collecting the stats is important here, I just mentioned it for completeness.
According to my perf runs, I'm faced with some code that has got (in my opinion) a relatively high number of "Dispatch Stalls" (event selector 0xd1) of 10-15% of all cycles. Further investigation showed that these are "Dispatch Stall for Reservation Station Full" (event selector 0xd6).
"Instruction fetch stalls" (event selector 0x87) are relatively large, too: more than 50% of all cycles. The number of "Decoder Empty" (event selector 0xd0) events is negligible though.
L1 misses (*TLB, dcache, icache) and branch mispredictions are near to zero. The code doesn't employ the FPU, neither directly, nor through some SSE or whatever stuff (verified by assembly listing).
Since the "Decoder Empty" events are relatively rare, I assume, that the "Instruction fetch stalls" do not cause the problem but are a consequence of the fact that the macro ops can't get dispatched fast enough to the integer execution unit.
First question: is this interpretion correct?
Now I wonder, what could cause these dispatch stalls.
If I read Appendix A.11.2 ("Integer Execution Unit") of "Software Optimization Guide for AMD Family 10h and 12h Processors" correctly, it means, that the Instruction Control Unit can't dispatch some macro op to any of the three integer schedulers since they all have their 8 entries already filled up. Or if the to-be-dispatched macro op is one of the special operations multiply, divide, LZCNT or POPCNT and that the single scheduler capable of handling this special instruction is full.
Second question: Is this understanding correct?
Now, what could actually cause the "Dispatch Stall for Reservation Station Full" events? Some integer operation with a large latency (perhaps bsf or bsr which take 4 cycles)?
Every 24th instruction is a bsf.
Also worth to mention might be that my code has a large amount of branches: 25-30% of all retired instructions are branch instructions (all direct).
Is the length of dependency chains significant for the dispatch stalls here?
I think not, otherwise I would see some "Dispatch Stall for Reorder Buffer Full"?
Since I'm an absolute newbie in interpreting performance counters, I would really appreciate your help!
Thanks a lot for your time and efforts,