To understand the theoretical performance of a kernel, is it true that each spu basically performs one alu instruction group per clock cycle? (Or four every four clock cycles or something taking into account elements being processed in groups?) This does yield the 500GFLOPs float4 multiply-add performance. But does it also hold in particular for double precision multiply-add (which would then give about 100GFLOPS), and for instruction groups in which the transcendental unit is doing something "complicated" like a sin, square root, or reciprocal?