I was kind of hoping someone from AMD could comment, at least on #2.
The ALUInsts counter reported by the Stream Profiler is the average number of ALU (or ALU bundles in your terminology above) instructions executed per work-item.
Two problems with the calculation in the first post: it assumes that the ALUInsts counter is per wavefront and it doesn't taken into account the number of compute units and stream cores in the GPU.
To correlate the kernel timing with the ALUInsts and ALUBusy counters, I'd use the following equation (it only makes sense for ALU bound kernel though):
Kernel Time (in second) = ( total work items * ALUInsts ) / ( ALUBusy * number of compute units * number of stream cores per compute unit * engine clock )
The ALUFetchRatio is just ALUInsts divided by FetchInsts.
Another question please:
Given SimpleConvolution 4k*4k, Fetch Busy is 26.59%. ~0% stalled.
Is GDDR5 4xMemoryClock or just 2x?
For example, the run time above is 16.7. So the fetch unit should be ~4.4.
But if you calculate it:
Bits Fetched in total: 2813297704 bits / (256*1200Mhz*4) = ~2.2.
I'm confused why the effective memory clock here is 2400 and not 4800 as it should be, right?
Or anyone who knows anything about the gddr5 on the 5870?