This content has been marked as final. Show 6 replies
The problem with double precision is that it uses all four alu's in the simd to perform the operation, thus automatically limiting it to 1/4th the performance. For the transcendentals, they are 1/5-1/15 the speed depending on how they are expanded. One way to see this is to disassemble a kernel that uses a transcendental and see which instructions it uses to compute it. Another way is to do run a long ALU kernel with mad instructions and one with a transcendental instruction and to see the multiplier difference between the times. The numbers above are estimates, but it's not hard to get the exact ratio's through some simple kernels.
I understood that the FLOPS for DP mul is 1/4 of SP mul for the reason you mention. After your reply I now see that e.g. b=sin(a) in brook disassembles into multiple r600 instructions, only one of which is the SIN one. However, the starting thought for this question was pretty simple being: is DP mad as quick as DP mul?
This led to the general question about whether all instruction groups complete in 1 cycle. What I was trying to get at was that if I look at the r600 disassembly in the shader analyser, is it guaranteed that each instruction group takes one cycle, no matter what the detailed contents of the instruction group are?
e.g. cutting and pasting from random kernels, does
27 t: SQRT_IEEE R0.x, R0.w
38 x: MULADD_64 R0.x, R0.y, R2.y, R1.y
y: MULADD_64 R0.y, R0.y, R2.y, R1.y
z: MULADD_64 R3.z, R0.y, R2.y, R1.y
w: MULADD_64 R3.w, R0.x, R2.x, R1.x
take just the same amount of time as
37 x: MUL_64 R123.x, R1.y, R2.y
y: MUL_64 R123.y, R1.y, R2.y
z: MUL_64 ____, R1.y, R2.y
w: MUL_64 ____, R1.x, R2.x
6 x: MULADD R123.x, -PV(5).x, KC0.y, R127.z
y: MULADD R123.y, -PV(5).y, KC0.x, R127.z
z: MULADD R123.z, -PV(5).z, KC0.z, R127.z
w: MULADD R123.w, -PV(5).w, KC0.w, R127.z
,i.e. one clock cycle, to execute?
(Unfortunately I am waiting for 1.0 and linux support to actually run on a 3870 and at the moment am developing software only on my XP laptop so can't just test it and see...)
Yes, each of those ALU clauses takes 1 cycle to execute. The major difference in performance comes from the fact that double op takes up four stream cores and the sqrt can only be run on the T stream core. DMAD and DMUL run at the same speed, but DADD runs at twice the speed.
Thanks for the info Micah.
So in terms of peak GFLOPS, a 3870 is about 100 for dmad, 50 for dmul and 100 for dadd then. Not bad! As well as MULADD_64's being single-cycle I guess I'm rather impressed that the trans unit can do "complicated" things like square roots as quickly as "simple" things like multiplications!
Just a little clarification, dmad and dmul are 50ish, dadd is twice that.
I was counting one dmad as two floating point operations, which I think makes us agree!