Thanks Micah,
I understood that the FLOPS for DP mul is 1/4 of SP mul for the reason you mention. After your reply I now see that e.g. b=sin(a) in brook disassembles into multiple r600 instructions, only one of which is the SIN one. However, the starting thought for this question was pretty simple being: is DP mad as quick as DP mul?
This led to the general question about whether all instruction groups complete in 1 cycle. What I was trying to get at was that if I look at the r600 disassembly in the shader analyser, is it guaranteed that each instruction group takes one cycle, no matter what the detailed contents of the instruction group are?
e.g. cutting and pasting from random kernels, does
27 t: SQRT_IEEE R0.x, R0.w
or
38 x: MULADD_64 R0.x, R0.y, R2.y, R1.y
y: MULADD_64 R0.y, R0.y, R2.y, R1.y
z: MULADD_64 R3.z, R0.y, R2.y, R1.y
w: MULADD_64 R3.w, R0.x, R2.x, R1.x
take just the same amount of time as
37 x: MUL_64 R123.x, R1.y, R2.y
y: MUL_64 R123.y, R1.y, R2.y
z: MUL_64 ____, R1.y, R2.y
w: MUL_64 ____, R1.x, R2.x
or
6 x: MULADD R123.x, -PV(5).x, KC0[6].y, R127.z
y: MULADD R123.y, -PV(5).y, KC0[6].x, R127.z
z: MULADD R123.z, -PV(5).z, KC0[6].z, R127.z
w: MULADD R123.w, -PV(5).w, KC0[6].w, R127.z
,i.e. one clock cycle, to execute?
(Unfortunately I am waiting for 1.0 and linux support to actually run on a 3870 and at the moment am developing software only on my XP laptop so can't just test it and see...)
Best,
Steven.