Archives Discussions

buqchucker · ‎01-04-2013

Hello everybody,

I just did some tests on a FirePro W7000 using

clAmdBlas-1.8.291 on Linux (Fedora 17, kernel 3.6.10-2.fc17, x86_64).

Although tuning was not possible, as written in

http://devgurus.amd.com/message/1286114

the performance of sgemm was lower than expected.

According to the sgemm testing available in clMagma

the speed amounted to 900 GFLOPs, although the W7000

is advertised with 2.4 TFLOPs.

Interestingly, the dgemm performance for double precision

was in spec, that is, around 150 GFLOPs.

This has been tested using a Pitcairn card (W7000) using Driver 9.003.3-121120a-151130C

and AMDAPP-SDK v2.8

Can anyone give a hint as to why the single precision

performance is so far behind the peak performance?

Regards,

buqchucker

realhet · ‎01-04-2013

Hi,

I guess that a memory bandwidth bottleneck kicks in at single precision.

On SP it needs half amount of data, but it can calculate 16x faster. Thus SP needs 8x faster memory access compared to DP in order to achieve maximum ALU utilization.

buqchucker · ‎01-04-2013

Ah, ok, I get the idea.

Indeed, that might be a possibility.

How did you arrive at the 16x factor?

I gather you need 4 times the number of multiplications using

32bit operations instead of 64bit operations. Are there

additional factors to consider?

buqchucker

realhet · ‎01-05-2013

>How did you arrive at the 16x factor?

Just check the card's specifications. It's 2.4TFlops SP performance compared to 0.15 TFlops DP perf.

In SP the card uses only a small 24bit multiplier (not 32), it also has a 32bit adder too. So it can do a float32 MAD in every cycle. Also it can add 32bit integers or there's a special one: multiply two 24 bit ints and add a 32bit int to the result.

For DP it needs a lot more circuitry and it cannot reuse SP circuits. And because the card is designed mainly for SP there need to be separate circuits for DP math.

On the fastest cards the DP:SP performance ratio is 1:4, on the medium cards there are less DP units and on the smallest cards there are no DP units at all.

(*a 32bit integer multiply is executed using the DP units, so they're slower than the 24bit int multiply)

buqchucker · ‎01-06-2013

Yes, that clarifies things.

Thanks a lot!

developer · ‎01-07-2013

Reaching 50% of peak in SGEMM itself is an achievement.

People struggle for months to reach something like 60 or 65%

btw, I am talking about the NN variant.

The TN variant produces the least flops due to less flop/memory fetch ratio.

Try "cgemm" - This gives more flop/memory fetch ratio.

There is lot of math involved in complex numbers and will produce impressive flop numbers

buqchucker · ‎01-08-2013

Indeed, I checked cgemm and the performance

peaked at around 1500 GFLOPs.

Archives Discussions

clAmdBlas/sgemm far from peak performance on FirePro W7000?