cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Nikolai_
Journeyman III

OpenCL BLAS (sgemm) performance on Radeon 4000 and 5000 series?

anyone have any results?

Hello!

i just want to see how close to theoretical flops the radeon/firestream cards get... i figure sgemm results would be a measure

0 Likes
66 Replies

I upload the full disassembly code on my site.

Read insturcions are merged as far as I understand correctly. And it uses 25 registers so that  I can have 10 threads, is this correct?

Thanks for detailed explanations. I did not realize 2D filling curve access was used in the prunedtree code. Deciphering ISA is really difficult.

0 Likes

Originally posted by: nnsan And it uses 25 registers so that  I can have 10 threads, is this correct?

In theory. But in practice number of threads which is not a power of 2 can have negative impact on performance.

Thanks for detailed explanations. I did not realize 2D filling curve access was used in the prunedtree code. Deciphering ISA is really difficult.

 

True . That's why I've rewriten his 8x8 block code using CAL++.

So difference between your code and CAL++ version is only usage of aoffimi. This means you should have similar performance.

0 Likes

Originally posted by: nnsanAnd it uses 25 registers so that  I can have 10 threads, is this correct?


I believe that's correct:

10 hardware threads * 64 work items * 25 GPRs = 16000 GPRs.

The clause temporaries used are T0 and T1. These consume:

2 clause temporaries * 64 work items * 2 hardware threads  = 256 GPRs.

The total GPR allocation is therefore 16256 GPRs and the capacity of the register file is 16384.

0 Likes

Originally posted by: nnsan  My code reads from two streams but prunedtree's code reads from four streams. Instead I use the offset feature of sample_resource.  Do you think these differences affect the performance (i.e. cache hit rate)?

 

I think it shouldn't have impact.

0 Likes

prunedtree also reported extremely high L1 bandwidth usage in his original description of his technique, 444 GB/s out of a theoretical 480GB/s.

Additionally his code uses a tiled, rather than linear, access pattern, maximising efficiency. His kernel reads pre-computed addresses, i.e. work item IDs are assigned along a 2D space filling curve.

The use of aoffimmi shouldn't have any effect. It saves some registers though, which is very very very useful.

Finally, since A is transposed, the memory access pattern is not strictly "matrix multiply".

 

0 Likes

Originally posted by: Jawed prunedtree also reported extremely high L1 bandwidth usage in his original description of his technique, 444 GB/s out of a theoretical 480GB/s.


Prunedtree have written that he achieved 444GB/s with synthetic tests. He had slightly smaller value for matrix mul. This gives ~90% efficiency , which I have no problem with.

But code posted by nnsan would need to get 99% of theoretical bandwitch to achive 2.1TFLOPs. This simply isn't possible for ati gpu.

0 Likes

http://forum.beyond3d.com/showthread.php?p=1369860#post1369860

Optimizing a little for RV870, I managed to reach up to 1083 GB/s (L1 fetch bandwidth peaks at 1088 GB/s afaik) with 12x8 blocks (that puts the TMU bottlneck at 2.6 TFlop/s) yet this only achieves 2.17 TFlop/s in practice: ALU becomes the bottlneck.


Also, nnsan's kernel is a pixel shader (I've only just noticed), so it is, by default, using tiling for work item location - which means the kernel should have a very favourable L1 access pattern. In other words, nnsan does not need to worry about trying to duplicate the pre-computed addressing that prunedtree has used.

0 Likes

Also, nnsan's kernel is a pixel shader (I've only just noticed), so it is, by default, using tiling for work item location - which means the kernel should have a very favourable L1 access pattern. In other words, nnsan does not need to worry about trying to duplicate the pre-computed addressing that prunedtree has used.


This is exactly the same as kernel in CAL++. So nnsan should have similar performance.

Btw are you sure that prunedtree used EXACTLY the same pattern which is used in pixel shader ( specially that there is no official info about the pattern used in gpu* ) ? Small change there can have significant impact on cache hits performance.

* I think i've seen somewhere comment that ATI doesn't want to disclose it because it can change with architecture ( or something )

0 Likes

Originally posted by: hazemanBtw are you sure that prunedtree used EXACTLY the same pattern which is used in pixel shader ( specially that there is no official info about the pattern used in gpu* ) ? Small change there can have significant impact on cache hits performance.


No I'm not saying that the pattern used is the same in both cases. Merely that prunedtree reported that tiling improves performance in his algorithm. In fact tiling is essential, as performance for N>1024 "crumbled".

prunedtree's tiling, for example, might account for the fact that there's a 4:1 ratio in the vertical and horizontal directions due to float4 packing. Whereas pixel shader doesn't.

0 Likes

Excuse me for being a bit off the current discussion topic.

In my experience, it is important to validate output data as there are cases in which a kernel seems perfectly fine except that the results are wrong. A kernel design that works at one combination of matrix dimensions, inner/outer blocking sizes, accumulation loop order, etc may fail at another point in the kernel solution space. While everything I am doing is with OpenCL, I believe the same philosophy applies to CAL/IL (as it is also translated by a compiler) and even ISA if there is further translation done somewhere down at the device level.

I believe data validation should be the ultimate criterion for correctness rather than code inspection (even at the ISA level). If the output is good, then we know however the kernel works, the calculation is correct. We then know how much work is done by whatever measure of complexity we are using during the time in which the kernel executes.

0 Likes

I've recoded nnsan kernel in CAL++.Here are the results for 4770

2 images with burst write ( 30 regs )- 523 GFLOP/s

2 images without burst write ( 26 regs ) - 623 GFLOP/s ( this is closest to prunedtree's code - i think he had 28 regs used )

1 image with burst write ( 30 regs ) - 548GFLOPs - internal loop (ISA) is 2 ops longer than for nnsan kernel

1 image without burst write ( 26 regs ) - 660GFLOPs - internal loop (ISA) is 2 ops shorter than for nnsan kernel

Difference in internal loop length is simply result of bad quality of CAL/IL compiler. Also burst write shoudn't increase register usage. Registers for burst must have consecutive indexes. So any resonable compiler should use correct indexes from the beginning. But this is too much for CAL/IL compiler :////// .

PS. code is available in CAL++ svn

0 Likes

Originally posted by: cjang In my experience, it is important to validate output data as there are cases in which a kernel seems perfectly fine except that the results are wrong.


You are totally right here.

We then know how much work is done by whatever measure of complexity we are using during the time in which the kernel executes.


This is not always valid assumption - sometimes by mistake we can use different value ( i've seen this happen ).

And with ATI IL i think looking at the ISA is basic optimisation tool. CAL/IL compiler can make really stupid mistakes which can be corrected by slightly changing code.

 

0 Likes

The input data must either be padded with zeroes to ensure that each work-item never fetches out of bounds or the kernel does bounds checking. The latter option is going to produce too much computational overhead, generally, though.

So the padding has to be adapted to the block dimensions if the implementation has variable sized blocks.

In general the GPU will have junk data in memory (no different from a PC's memory). Only sufficient zero-padding (done host side, then everything copied) is going to ensure that the junk doesn't affect the result.

0 Likes

Jawed,  thanks for precise numbers.

I think in PS mode, it automatically change the number of threads depending on the number register and we have no contorol. In CS, we can specify the number of threads through dcl_num_thread instructions (correct me if I'm wrong.)

Originally posted by: hazeman I've recoded nnsan kernel in CAL++.Here are the results for 4770

2 images with burst write ( 30 regs )- 523 GFLOP/s

2 images without burst write ( 26 regs ) - 623 GFLOP/s ( this is closest to prunedtree's code - i think he had 28 regs used )

1 image with burst write ( 30 regs ) - 548GFLOPs - internal loop (ISA) is 2 ops longer than for nnsan kernel

1 image without burst write ( 26 regs ) - 660GFLOPs - internal loop (ISA) is 2 ops shorter than for nnsan kernel



Your results seem to indicate that register usage is critical hence the number of threads is. As you know, without the "breakc" hack my code requires 36 registers and shows ~ 1500 Gflops at maximum on Cypress. In my case, with the hack I have 256/25 > 10 threads and without I have 256/36 > 7 threads. 10/7 is roughly same as the performance ratio.

0 Likes

Originally posted by: nnsan I think in PS mode, it automatically change the number of threads depending on the number register and we have no contorol. In CS, we can specify the number of threads through dcl_num_thread instructions (correct me if I'm wrong.)


You are correct.

Your results seem to indicate that register usage is critical hence the number of threads is.


This is rather obvious - the whole kernel design is based on being able to run on 8 threads ( hiding latency from L1 cache hits ).

As you know, without the "breakc" hack my code requires 36 registers and shows ~ 1500 Gflops at maximum on Cypress. In my case, with the hack I have 256/25 > 10 threads and without I have 256/36 > 7 threads. 10/7 is roughly same as the performance ratio.


I've run test with 9..16 threads. Penalty for using 9 or 10 threads instead of 8 is really huge ( 40-50% ) and 16 is slightly better than 8. So I really doubt that your kernel is running on 10 threads.

I think that what happend is that kernel wth 30 regs was running on 7 threads ( or 4 - which might be more efficient ) ( even thou ISA reported that it can run 8 threads ). Reducing number of registers to 26 allowed it to run on 8 threads.

 

0 Likes

I've found a problem in CAL++ matrix mul examples which caused increased register usage ( +4 registers ). The "problem" was with changing mad ordering to trick CAL/IL compiler into generating efficient code. I have to admit that the change was based on nnsan IL kernel, but later after looking at prunedtree's code I've noticed that he used it.

Now 2 image kernel (matrixmult) uses 26 regs, and 1 image kernel (matrixmult2) 25 regs. Also I've included A*B kernel (matrixmult3).

Could someone with 5870 card benchmark all 3 kernels ? The code is available in CAL++ svn.

0 Likes

Great work!  2 Tflops barrier broken 

Results from a 5870, N = 1024, 2048, 3840:

matrixmult - 1575, 1751, 1898

matrixmult2 - 1546, 1932, 2117

matrixmult3 - 1142, 1395, 1414

It looks like you didn't include matrixmult3 in the CMake lists file so I had to add this to the Cmake list file.

 

(Edited, as I realized I made a mistake when trying to include matrixmult3 in the Cmake list, and added more comprehensive performance figures.)

0 Likes

I've updated matrixmul3 ( with block size 8x8 ) - in theory it should work better ( the numbers look ok ). Also corrected cmake list file.

0 Likes

Just reran the numbers for new matrixmult3:

N = 1024, 2048, 3072, 3840

Gflops = 1065, 1391, 1385, 1413

Not significantly different from before.

0 Likes

Here are recent performance benchmarks for pure matrix multiply (not SGEMM) with auto-tuned OpenCL kernels.

OpenCL SDK v2.1 Catalyst 10.4 Ubuntu 9.10 x86_64
matrix multiply (not SGEMM) at N/M/K = 3840
A row-major, B row-major 1458
A col-major, B row-major 1437
A row-major, B col-major 917
A col-major, A col-major 727

0 Likes

Originally posted by: mikeaclark

Results from a 5870, N = 1024, 2048, 3840:

 

matrixmult2 - 1546, 1932, 2117

This result is consistent with my result : Single Precison

My code is also somehow working correctly

*beer*


0 Likes

Hi, 

I have submitted my results as paper contribution to a workshop.

Thanks you for discussions, hazeman, Jawed and cjang.

abstract  is available

http://galaxy.u-aizu.ac.jp/trac/note/wiki/Fastest_GEMM_implementation_On_Cypress

 

BTW, I have recently noticed that OpenCL users' guide includes

in depth information on Cyprees architecture. See Chapter 4.

Many topics we discussed here is officially explained.

http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf

0 Likes

Hi, 

At last, I make a part of our kernels public. A sample program for our DGEMM kernels are available at

http://github.com/dadeba/dgemm_cypress/

For SGEMM and DDGEMM kernels, we will plan to distribute them but not now since they are still work in progress. 

0 Likes

nnsan,
Congrats on your work, very impressive. It is nice to see people using the newer features of IL to make the code simpler to read.
0 Likes

Suddenly, I've received an e-mail notifying the new forum system.

I did not post quite long time but the notification is somehow good timing.

Here is our new DGEMM and SGEMM performance on 7970.

http://galaxy.u-aizu.ac.jp/data/DGEMM_12.1.png

http://galaxy.u-aizu.ac.jp/data/SGEMM.png

Still, we stick to IL at the moment. Catalyst 12.1 works well and give us a very nice performance bump!

0 Likes

Thanks for the update nnsan!

0 Likes

Wow!

This is realy impressive. The DGEMM result is really really cool. ~700GFlop/s are very much.

One question about this. As far i know the 7970 have only 1:4 DP:SP ratio. Is this true? And will the FirePro Version will have a 1:2 ratio???

If yes, this card will be realy cool.

0 Likes