In the document 'AMD Accelerated Parallel Processing OpenCL Programming Guide' provided here

Table 5.3 gives the instructions per cycle (IPC) ratings for various instructions from which we may calculate the peak FLOPS for both single and double precision calculations. Using the table I calculate the double precision peak FLOPS as

dp_add_flops = total_alu_count * clock_rate * dp_add_ipc

= 2048 * 1.05 GHz * 0.5

= 1.0752 TFLOPS

which is roughly in line with the advertised performance, however for single precision I have

sp_add_flops = total_alu_count * clock_rate * sp_add_ipc

= 2048 * 1.05 GHz * 4

= 8.6016 TFLOPS

which is exactly double the advertised performance. What am I missing here? If the single point add IPC is reduced to 2 then the numbers are spot on, however, that does not agree with the specs provided in the document identified above. Also is there a place where I can find very detailed hardware specifications for my card specifically?

Thanks

~ry

I think the table is probably correct, but a little interpretation is needed.

Cards like Tahiti 7970, 7950 are "Full Speed Double Precision devices", so they are in the right column.

Full Speed only means the best the architecture can do, no specific speed.

The word "cycle" here means 4 clocks. Stream processors issue a wave in 4 parts with a minimum

4 clock latency which is considered 1 cycle. However, instructions can be issued

on each clock of a cycle thus 4 insns/cycle.

Most

FPinstructions are 4/cycle (1/clock), which is impressive for FMA and the like.The transcendentals (

rcp, sin, log, sqrt, rsqrt) are 1/cycle.Most all

DPis 1/cycle exceptADD, where they manage to squeak out 2.(note they choose

ADDto calculate a performance forDP).Also, "peak" performance is almost always based on multiply + add insns (

MAD)which count as 2

instructions per instruction, which gives a FACTOR of 2.Using clocks, not cycles, peak performance would be .

(1insn/clock)*(2048)*(1.0e9)*FACTOR = 4.096 TFlops/sec of most FP and Int.

(1/4 insn/clock)*(2048)*(1.0e9)*FACTOR = 1.024 TFlops/sec DP.

Using cycles would be 4 or 1 insn/cycle and a cycle speed of 0.25e9, which comes out the same.

Basic

intoperations are 4/cycle with the big exception the 32 bit mul and mad reduce to 1/cycle.However there are 24 bit accuracy versions of

madandmulthat run at 4 insns/cycle.Presumably the reason is the 24 bit insns use the fast

FPmultipliers, which only have tomultiply 24 bit mantissas.At least that was always my guess.

Edit, fixed the ambiguous phrase

"many instructions can be issued on each clock thus 4 insns/cycle.