instruction group throughput

Discussion created by sgratton on Apr 10, 2008
Latest reply on Apr 10, 2008 by sgratton
Hi there,

To understand the theoretical performance of a kernel, is it true that each spu basically performs one alu instruction group per clock cycle? (Or four every four clock cycles or something taking into account elements being processed in groups?) This does yield the 500GFLOPs float4 multiply-add performance. But does it also hold in particular for double precision multiply-add (which would then give about 100GFLOPS), and for instruction groups in which the transcendental unit is doing something "complicated" like a sin, square root, or reciprocal?