Instruction throughput clarification

Question asked by zoska on Nov 16, 2014
I am reading AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf and on page 6-24 (or 126 in absolute terms) there is: Table 6.3 Instruction Throughput (Operations/Cycle for Each Stream Processor). It lists eg One Quarter-Double-Precision-Speed Devices SPFP MAD Rate (Operations/Cycle) for each Stream Processor as 4. That confuses me. Does it mean that each ALU can effectively complete 8 FLOPs per cycle? Single MAD instruction counts as 2 FLOPs, but if the table shows 4 MADs per ALU, then eg whole Radeon HD 7970 GHz Edition should have peak computational throughput estimated as 2048 (ALUs) * 1 (GHz) * 8 (FLOP / cycle / ALU) ~=~ 16.4 (TFLOPS)?


The table with Instruction Throughput for Evergreen and Northern Islands Devices makes much more sense.


So to recap, how to interpret the numbers in Instruction Throughput tables?