Hello everybody,
I just did some tests on a FirePro W7000 using
clAmdBlas-1.8.291 on Linux (Fedora 17, kernel 3.6.10-2.fc17, x86_64).
Although tuning was not possible, as written in
http://devgurus.amd.com/message/1286114
the performance of sgemm was lower than expected.
According to the sgemm testing available in clMagma
the speed amounted to 900 GFLOPs, although the W7000
is advertised with 2.4 TFLOPs.
Interestingly, the dgemm performance for double precision
was in spec, that is, around 150 GFLOPs.
This has been tested using a Pitcairn card (W7000) using Driver 9.003.3-121120a-151130C
and AMDAPP-SDK v2.8
Can anyone give a hint as to why the single precision
performance is so far behind the peak performance?
Regards,
buqchucker
Hi,
I guess that a memory bandwidth bottleneck kicks in at single precision.
On SP it needs half amount of data, but it can calculate 16x faster. Thus SP needs 8x faster memory access compared to DP in order to achieve maximum ALU utilization.
Ah, ok, I get the idea.
Indeed, that might be a possibility.
How did you arrive at the 16x factor?
I gather you need 4 times the number of multiplications using
32bit operations instead of 64bit operations. Are there
additional factors to consider?
buqchucker
>How did you arrive at the 16x factor?
Just check the card's specifications. It's 2.4TFlops SP performance compared to 0.15 TFlops DP perf.
In SP the card uses only a small 24bit multiplier (not 32), it also has a 32bit adder too. So it can do a float32 MAD in every cycle. Also it can add 32bit integers or there's a special one: multiply two 24 bit ints and add a 32bit int to the result.
For DP it needs a lot more circuitry and it cannot reuse SP circuits. And because the card is designed mainly for SP there need to be separate circuits for DP math.
On the fastest cards the DP:SP performance ratio is 1:4, on the medium cards there are less DP units and on the smallest cards there are no DP units at all.
(*a 32bit integer multiply is executed using the DP units, so they're slower than the 24bit int multiply)
Yes, that clarifies things.
Thanks a lot!
Reaching 50% of peak in SGEMM itself is an achievement.
People struggle for months to reach something like 60 or 65%
btw, I am talking about the NN variant.
The TN variant produces the least flops due to less flop/memory fetch ratio.
Try "cgemm" - This gives more flop/memory fetch ratio.
There is lot of math involved in complex numbers and will produce impressive flop numbers
Indeed, I checked cgemm and the performance
peaked at around 1500 GFLOPs.