cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

buqchucker
Adept I

clAmdBlas/sgemm far from peak performance on FirePro W7000?

Hello everybody,

I just did some tests on a FirePro W7000 using

clAmdBlas-1.8.291 on Linux (Fedora 17, kernel 3.6.10-2.fc17, x86_64).

Although tuning was not possible, as written in

http://devgurus.amd.com/message/1286114

the performance of sgemm was lower than expected.

According to the sgemm testing available in clMagma

the speed amounted to 900 GFLOPs, although the W7000

is advertised with 2.4 TFLOPs.

Interestingly, the dgemm performance for double precision

was in spec, that is, around 150 GFLOPs.

This has been tested using a Pitcairn card (W7000) using Driver 9.003.3-121120a-151130C

and AMDAPP-SDK v2.8

Can anyone give a hint as to why the single precision

performance is so far behind the peak performance?

Regards,

buqchucker

0 Likes
6 Replies
realhet
Miniboss

Hi,

I guess that a memory bandwidth bottleneck kicks in at single precision.

On SP it needs half amount of data, but it can calculate 16x faster. Thus SP needs 8x faster memory access compared to DP in order to achieve maximum ALU utilization.

Ah, ok, I get the idea.

Indeed, that might be a possibility.

How did you arrive at the 16x factor?

I gather you need 4 times the number of multiplications using

32bit operations instead of 64bit operations. Are there

additional factors to consider?

buqchucker

0 Likes

>How did you arrive at the 16x factor?

Just check the card's specifications. It's 2.4TFlops SP performance compared to 0.15 TFlops DP perf.

In SP the card uses only a small 24bit multiplier (not 32), it also has a 32bit adder too. So it can do a float32 MAD in every cycle. Also it can add 32bit integers or there's a special one: multiply two 24 bit ints and add a 32bit int to the result.

For DP it needs a lot more circuitry and it cannot reuse SP circuits. And because the card is designed mainly for SP there need to be separate circuits for DP math.

On the fastest cards the DP:SP performance ratio is 1:4, on the medium cards there are less DP units and on the smallest cards there are no DP units at all.

(*a 32bit integer multiply is executed using the DP units, so they're slower than the 24bit int multiply)

0 Likes

Yes, that clarifies things.

Thanks a lot!

developer
Adept II

Reaching 50% of peak in SGEMM itself is an achievement.

People struggle for months to reach something like 60 or 65%

btw, I am talking about the NN variant.

The TN variant produces the least flops due to less flop/memory fetch ratio.

Try "cgemm" - This gives more flop/memory fetch ratio.

There is lot of math involved in complex numbers and will produce impressive flop numbers

0 Likes

Indeed, I checked cgemm and the performance

peaked at around 1500 GFLOPs.

0 Likes