5 Replies Latest reply on Nov 18, 2014 3:38 AM by nan

# Instruction throughput clarification

Hi,

I am reading AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf and on page 6-24 (or 126 in absolute terms) there is: Table 6.3 Instruction Throughput (Operations/Cycle for Each Stream Processor). It lists eg One Quarter-Double-Precision-Speed Devices SPFP MAD Rate (Operations/Cycle) for each Stream Processor as 4. That confuses me. Does it mean that each ALU can effectively complete 8 FLOPs per cycle? Single MAD instruction counts as 2 FLOPs, but if the table shows 4 MADs per ALU, then eg whole Radeon HD 7970 GHz Edition should have peak computational throughput estimated as 2048 (ALUs) * 1 (GHz) * 8 (FLOP / cycle / ALU) ~=~ 16.4 (TFLOPS)?

The table with Instruction Throughput for Evergreen and Northern Islands Devices makes much more sense.

So to recap, how to interpret the numbers in Instruction Throughput tables?

• ###### Re: Instruction throughput clarification

Hi,

simply divide the numbers in the table by 4, e.g. the maximum throughput of one PE is 1 MAD/cycle. Please consider to complain about erroneous documentation in Feedback discussion: How is AMD doing for developers? because inconsistencies are very common.

-- NaN

• ###### Re: Instruction throughput clarification

Thanks for response.

Actually, I've spotted some answers on that question in other discussions, but even then it's not clear for me.

Wikipedia says on List of AMD graphics processing units - Wikipedia, the free encyclopedia:

Double precision performance of Hawaii is 1/8 of single precision performance,[23] Tahiti is 1/4 of single precision performance, others 28 nm chip is 1/16 of single precision performance.

and in the PDF in example throughput calculation they've took left value for Radeon HD 7970 and they weren't dividing by 4. So I think I can't just divide everything by 4.

Agner Fog on his site (agner.org/optimize) provides tables for CPU instructions and he describes both latencies and reciprocal throughput. That is very clear for me, I don't need to divide anything and I can estimate performance easily for both dependent and independent instructions stream.

• ###### Re: Instruction throughput clarification

Hi,

the document makes the following computation for double precision throughput of a HD 7970: ".5*2048*925 MHz". So they used the factor 1/2, which is according to the corrected table 2/4 ADD/cycle = 1/2 flop/cycle or 1/4 MAD/cycle = 1/2 flop/cycle. According to my experience the ratio of the values in the table seems to be correct, i.e. 32bit integer multiplies are much slower than 24bit multiplies and all other simple 32bit operations are processed with full speed. The throughput of some 64bit integer operations would be interesting, too. I cannot make a precise statement for Hawaii based card, but they seem to be similar to Tahiti except the halved double precision throughput, because I haven't observed throughput differences between Hawaii and Tahiti when using 32bit operations.

ALU latencies don't matter on a GPU device because you need to hide other latencies like memory accesses with many other threads anyway. My guess is that the effective ALU latency is (at least for the most common instructions) 4 * reciprocal throughput, i.e. the minimum latency is 4 cycles, but the latency increases if the result is needed by another unit. Also, most GPUs are strictly in-order and you don't need to worry about dependent/independent instructions streams. Use instruction with high throughput, write SIMD friendly code and access the memory efficiently.

-- NaN

• ###### Re: Instruction throughput clarification

The table relates to a "stream processor" as 4 compute elements grouped together.

Hence to discover the GPU theoretical peak for a specific instruction type you need do the following calculation: value from the table * 4 stream processors per SIMD * 4 SIMDs per CU * the GPU CU count * the frequency.

• ###### Re: Instruction throughput clarification

That makes no sense for GCN, too. One SIMD has 16 PEs and one CU has 4 SIMDs. Most likely the table contains the number of executed quarter-wavefront VALU instructions per CU so that the maximum throughout of SPMAD of Tahiti with 32 CUs is 4*16*32*frequency MAD = 2*4*16*32*frequency flop and my previous interpretation was correct (64/4 = 16). There is also no definition of "stream processor" or "stream processing core" in the Programming Guide.

-- NaN