Hi there,

To understand the theoretical performance of a kernel, is it true that each spu basically performs one alu instruction group per clock cycle? (Or four every four clock cycles or something taking into account elements being processed in groups?) This does yield the 500GFLOPs float4 multiply-add performance. But does it also hold in particular for double precision multiply-add (which would then give about 100GFLOPS), and for instruction groups in which the transcendental unit is doing something "complicated" like a sin, square root, or reciprocal?

Thanks,

Steven.

To understand the theoretical performance of a kernel, is it true that each spu basically performs one alu instruction group per clock cycle? (Or four every four clock cycles or something taking into account elements being processed in groups?) This does yield the 500GFLOPs float4 multiply-add performance. But does it also hold in particular for double precision multiply-add (which would then give about 100GFLOPS), and for instruction groups in which the transcendental unit is doing something "complicated" like a sin, square root, or reciprocal?

Thanks,

Steven.

The problem with double precision is that it uses all four alu's in the simd to perform the operation, thus automatically limiting it to 1/4th the performance. For the transcendentals, they are 1/5-1/15 the speed depending on how they are expanded. One way to see this is to disassemble a kernel that uses a transcendental and see which instructions it uses to compute it. Another way is to do run a long ALU kernel with mad instructions and one with a transcendental instruction and to see the multiplier difference between the times. The numbers above are estimates, but it's not hard to get the exact ratio's through some simple kernels.