AnsweredAssumed Answered

Understanding performance counters

Question asked by chersanya on Aug 5, 2012
Latest reply on Aug 7, 2012 by binying

I have a kernel, and each workitem processes tens of elements (firstly perform some computation and then global memory read + write). The profiler gave me much help in optimizing it, however I want to go further Now the profiling data looks like this (almost the same for all kernel runs):

 

GlobalWorkSize     126720
WorkGroupSize     256
VGPRs     13
FCStacks     2
ALUInsts     6787.63
FetchInsts     52.47
WriteInsts     26.47
ALUBusy     98.31
ALUFetchRatio     129.37
ALUPacking     72.16
FetchSize     411503.38
CacheHit     0.09
FetchUnitBusy     89.44
FetchUnitStalled     93.86
WriteUnitStalled 0.00
FastPath     9.19
CompletePath     28.91
PathUtilization     30.52

 

LDS not used at all, kernel occupancy is 50% (VGPR-limited 16 waves).

I can't understand several points here:

  • What exactly means FCStacks value? I have only one loop (for), and no if statements, but its value is two.
  • How can be ALUBusy 98% with low ALUPacking (72%)? As I see from ALUPacking, not all VLIWs are filled at full, so ALUBusy shouldn't be so close to 100%
  • FetchUnitStalled > FetchUnitBusy while it's written that FetchUnitBusy includes stalled time - how?

 

And how to improve ALUPacking up to 100%?

Outcomes