cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

chersanya
Journeyman III

Understanding performance counters

I have a kernel, and each workitem processes tens of elements (firstly perform some computation and then global memory read + write). The profiler gave me much help in optimizing it, however I want to go further Now the profiling data looks like this (almost the same for all kernel runs):

GlobalWorkSize     126720

WorkGroupSize     256

VGPRs     13

FCStacks     2

ALUInsts     6787.63

FetchInsts     52.47

WriteInsts     26.47

ALUBusy     98.31

ALUFetchRatio     129.37

ALUPacking     72.16

FetchSize     411503.38

CacheHit     0.09

FetchUnitBusy     89.44

FetchUnitStalled     93.86

WriteUnitStalled 0.00

FastPath     9.19

CompletePath     28.91

PathUtilization     30.52

LDS not used at all, kernel occupancy is 50% (VGPR-limited 16 waves).

I can't understand several points here:

  • What exactly means FCStacks value? I have only one loop (for), and no if statements, but its value is two.
  • How can be ALUBusy 98% with low ALUPacking (72%)? As I see from ALUPacking, not all VLIWs are filled at full, so ALUBusy shouldn't be so close to 100%
  • FetchUnitStalled > FetchUnitBusy while it's written that FetchUnitBusy includes stalled time - how?

And how to improve ALUPacking up to 100%?

0 Likes
10 Replies