I have a kernel, and each workitem processes tens of elements (firstly perform some computation and then global memory read + write). The profiler gave me much help in optimizing it, however I want to go further Now the profiling data looks like this (almost the same for all kernel runs):
GlobalWorkSize 126720
WorkGroupSize 256
VGPRs 13
FCStacks 2
ALUInsts 6787.63
FetchInsts 52.47
WriteInsts 26.47
ALUBusy 98.31
ALUFetchRatio 129.37
ALUPacking 72.16
FetchSize 411503.38
CacheHit 0.09
FetchUnitBusy 89.44
FetchUnitStalled 93.86
WriteUnitStalled 0.00
FastPath 9.19
CompletePath 28.91
PathUtilization 30.52
LDS not used at all, kernel occupancy is 50% (VGPR-limited 16 waves).
I can't understand several points here:
- What exactly means FCStacks value? I have only one loop (for), and no if statements, but its value is two.
- How can be ALUBusy 98% with low ALUPacking (72%)? As I see from ALUPacking, not all VLIWs are filled at full, so ALUBusy shouldn't be so close to 100%
- FetchUnitStalled > FetchUnitBusy while it's written that FetchUnitBusy includes stalled time - how?
And how to improve ALUPacking up to 100%?