Ok, I think I asked this before and I either 1) misunderstood the answer or 2) got the incorrect answer, probably #1.

In the profiler, the number of ALU instructions reported is the TOTAL number of instructions executed NOT the number of ALU bundles executed correct?

For example, taking the Black Scholes sample at 2k*2k problem size the run time is ~7.9.

The ALU Busy is very high: 99.78, so each bundle does ~4.77 instructions. The profiler reports it is exectuing 440 ALU instructions per wavefront for a total of 64k Wavefronts (2k*2k/64).

If we assume the profiler means bundles then the calculated run time (assuming only ALU instructions since the ALU Busy is so high) we get ~34ms for the runtime (5870 at 850Mhz core clock), not even close to the ~7.9 reported by the profiler. If we then divide 34ms/4.77 ~= 7.2ms, which is much closer to the real run time of this mostly ALU bound kernel.

So, my point is:

1) ALU:Fetch reported by the profiler not only doesn't take into account the 4:1 the SKA does (which is fine by me, I don't like the SKA does it anyways) BUT it's not giving the ALU:Fetch ratio on a cycle scale so it's not a real indication alone of the ratio of cycle usage, you have to first multiply the number of VLIW cores with the ALU packing and then divide the ALU instr by that number.

2) I hope the ALU Busy is done based on the ALU cycles and is not calculated using the ALU instr alone (since this isn't cycle accurate)?

3) Just as a side profiler note it would be nice if at the worst case scenario (no mem/alu overlap) if the numbers were to add up to 100% for the busy but I suppose since there is no Write Busy it's impossible to tell.

Thanks.

I was kind of hoping someone from AMD could comment, at least on #2.