cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ryta1203
Journeyman III

Evaluating ALU Instructions in profiler

Ok, I think I asked this before and I either 1) misunderstood the answer or 2) got the incorrect answer, probably #1.

In the profiler, the number of ALU instructions reported is the TOTAL number of instructions executed NOT the number of ALU bundles executed correct?

For example, taking the Black Scholes sample at 2k*2k problem size the run time is ~7.9.

The ALU Busy is very high: 99.78, so each bundle does ~4.77 instructions. The profiler reports it is exectuing 440 ALU instructions per wavefront for a total of 64k Wavefronts (2k*2k/64).

If we assume the profiler means bundles then the calculated run time (assuming only ALU instructions since the ALU Busy is so high) we get ~34ms for the runtime (5870 at 850Mhz core clock), not even close to the ~7.9 reported by the profiler. If we then divide 34ms/4.77 ~= 7.2ms, which is much closer to the real run time of this mostly ALU bound kernel.

So, my point is:

1) ALU:Fetch reported by the profiler not only doesn't take into account the 4:1 the SKA does (which is fine by me, I don't like the SKA does it anyways) BUT it's not giving the ALU:Fetch ratio on a cycle scale so it's not a real indication alone of the ratio of cycle usage, you have to first multiply the number of VLIW cores with the ALU packing and then divide the ALU instr by that number.

2) I hope the ALU Busy is done based on the ALU cycles and is not calculated using the ALU instr alone (since this isn't cycle accurate)?

3) Just as a side profiler note it would be nice if at the worst case scenario (no mem/alu overlap) if the numbers were to add up to 100% for the busy but I suppose since there is no Write Busy it's impossible to tell.

Thanks.

0 Likes
5 Replies
ryta1203
Journeyman III

I was kind of hoping someone from AMD could comment, at least on #2.

0 Likes

The ALUInsts counter reported by the Stream Profiler is the average number of ALU (or ALU bundles in your terminology above) instructions executed per work-item.

Two problems with the calculation in the first post: it assumes that the ALUInsts counter is per wavefront and it doesn't taken into account the number of compute units and stream cores in the GPU.

To correlate the kernel timing with the ALUInsts and ALUBusy counters, I'd use the following equation (it only makes sense for ALU bound kernel though):

Kernel Time (in second) = ( total work items * ALUInsts ) / ( ALUBusy * number of compute units * number of stream cores per compute unit * engine clock )

The ALUFetchRatio is just ALUInsts divided by FetchInsts.

0 Likes

Thanks.

0 Likes

bpurnomo,

Another question please:

Given SimpleConvolution 4k*4k, Fetch Busy is 26.59%. ~0% stalled.

Is GDDR5 4xMemoryClock or just 2x?

For example, the run time above is 16.7. So the fetch unit should be ~4.4.

But if you calculate it:

Bits Fetched in total: 2813297704 bits / (256*1200Mhz*4) = ~2.2.

I'm confused why the effective memory clock here is 2400 and not 4800 as it should be, right?

0 Likes

Or anyone who knows anything about the gddr5 on the 5870?

0 Likes