Archives Discussions

sgratton · ‎04-10-2008

Hi there,

To understand the theoretical performance of a kernel, is it true that each spu basically performs one alu instruction group per clock cycle? (Or four every four clock cycles or something taking into account elements being processed in groups?) This does yield the 500GFLOPs float4 multiply-add performance. But does it also hold in particular for double precision multiply-add (which would then give about 100GFLOPS), and for instruction groups in which the transcendental unit is doing something "complicated" like a sin, square root, or reciprocal?

Thanks,
Steven.

MicahVillmow · ‎04-10-2008

Sgratton,
The problem with double precision is that it uses all four alu's in the simd to perform the operation, thus automatically limiting it to 1/4th the performance. For the transcendentals, they are 1/5-1/15 the speed depending on how they are expanded. One way to see this is to disassemble a kernel that uses a transcendental and see which instructions it uses to compute it. Another way is to do run a long ALU kernel with mad instructions and one with a transcendental instruction and to see the multiplier difference between the times. The numbers above are estimates, but it's not hard to get the exact ratio's through some simple kernels.

sgratton · ‎04-10-2008

Thanks Micah,

I understood that the FLOPS for DP mul is 1/4 of SP mul for the reason you mention. After your reply I now see that e.g. b=sin(a) in brook disassembles into multiple r600 instructions, only one of which is the SIN one. However, the starting thought for this question was pretty simple being: is DP mad as quick as DP mul?

This led to the general question about whether all instruction groups complete in 1 cycle. What I was trying to get at was that if I look at the r600 disassembly in the shader analyser, is it guaranteed that each instruction group takes one cycle, no matter what the detailed contents of the instruction group are?

e.g. cutting and pasting from random kernels, does

27 t: SQRT_IEEE R0.x, R0.w

or

38 x: MULADD_64 R0.x, R0.y, R2.y, R1.y
y: MULADD_64 R0.y, R0.y, R2.y, R1.y
z: MULADD_64 R3.z, R0.y, R2.y, R1.y
w: MULADD_64 R3.w, R0.x, R2.x, R1.x

take just the same amount of time as

37 x: MUL_64 R123.x, R1.y, R2.y
y: MUL_64 R123.y, R1.y, R2.y
z: MUL_64 ____, R1.y, R2.y
w: MUL_64 ____, R1.x, R2.x

or

6 x: MULADD R123.x, -PV(5).x, KC0[6].y, R127.z
y: MULADD R123.y, -PV(5).y, KC0[6].x, R127.z
z: MULADD R123.z, -PV(5).z, KC0[6].z, R127.z
w: MULADD R123.w, -PV(5).w, KC0[6].w, R127.z

,i.e. one clock cycle, to execute?

(Unfortunately I am waiting for 1.0 and linux support to actually run on a 3870 and at the moment am developing software only on my XP laptop so can't just test it and see...)

Best,
Steven.

MicahVillmow · ‎04-10-2008

Yes, each of those ALU clauses takes 1 cycle to execute. The major difference in performance comes from the fact that double op takes up four stream cores and the sqrt can only be run on the T stream core. DMAD and DMUL run at the same speed, but DADD runs at twice the speed.

sgratton · ‎04-10-2008

Thanks for the info Micah.

So in terms of peak GFLOPS, a 3870 is about 100 for dmad, 50 for dmul and 100 for dadd then. Not bad! As well as MULADD_64's being single-cycle I guess I'm rather impressed that the trans unit can do "complicated" things like square roots as quickly as "simple" things like multiplications!

Best,
Steven.

MicahVillmow · ‎04-10-2008

Just a little clarification, dmad and dmul are 50ish, dadd is twice that.

sgratton · ‎04-10-2008

Hi Micah,

I was counting one dmad as two floating point operations, which I think makes us agree!

Best,
Steven.

Archives Discussions

instruction group throughput