I'm not sure I'm doing this correctly so I just want to make sure:

There are three equations on page 1-24 of the SDK Guide describing how to calculate theoretical performance for stream core instructions, fetch instructions and memory instructions.

I have a few questions:

1. If you have an input stream and an output stream and these are both float4 to the kernel is the input considered toward both the fetch AND memory calculations OR just the fetch calculations.

2. The Stream Core Instructions for RV770 is 160, correct? 16*10

3. The Fetch Instructions for RV770 is 40, correct? 4*10

So, if I have a kernel with 4 inputs and 1 output (pixel shader, no global buffer, etc)... simple kernel, ALU:Fetch of 1.0 (16 ALU ops) then for the RV770 for a domain of 256x256 (2D)....

Stream Core should be:

((256*256)*(16)) / ((160*750MHz))

Fetch should be:

((256*256)*(4)) / ((40)*(750MHz))

Memory should be:

((256*256)*(128)) / ((256)*(900MHz*2DDR)) This assumes only output is used, not input also and float4 (32*4=128 bits).

In this case the memory is the bottleneck (if I am calculating correctly) but it shouldn't be, the ALU Ops (according to the SKA) should be the bottleneck.

I guess I figure that I am calculating incorrectly here and am asking what am I doing wrong?

And on top of that I'm not getting anywhere close to the expected times even for very simple kernels (I'm only timing from RunProgram to EventDone).