cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

laobrasuca
Journeyman III

caldgemm with HD6990s

Hi, I did some test with acmlgpu1.1.2. as I run the Info.exe, it shows

 

hi there, what's the difference between ACML-GPU and clAmdBlas? Would it be that one is CAL and the other OpenCL? And what about performance (at least for single GPU setup)?

0 Likes
rollyng
Journeyman III

caldgemm with HD6990s

Hi, I think you are right. clAmdBlas needs OpenCL but I find that there is only sgemm example for clAmdBlas, so I may not be able to compare dgemm performance of the two libraries?

0 Likes
laobrasuca
Journeyman III

caldgemm with HD6990s

yes, there's only the sgemm example (but be aware that this example has a typo fault - matrix A is written to the bufB - check this post http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=150952&enterthread=y), but, well, they perform exactly the same mathematical operations except for the data type (double instead of float), so I believe you can use the exact same example changing the types only and having a card that supports double precision computations (like the 6790 of yours).

if you comprare performance results with caldgemm, please let us know.

0 Likes
rollyng
Journeyman III

caldgemm with HD6990s

OK, let's have the acmlgpu-1.1.2 first for both dgemm and sgemm on single HD6970.

rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_dgemm.exe
Matrix  Time in       Performance
Size    Seconds       in Megaflops
------  ------------  ------------
   400      0.758880          168
   600      0.030139        14333
   800      0.035727        28662
  1000      0.049771        40184
  1200      0.067966        50849
  1400      0.068995        79541
  1600      0.086494        94711
  1800      0.111565       104549
  2000      0.134369       119075
  2200      0.161584       131795
  2400      0.184988       149458
  2600      0.214080       164200
  2800      0.241470       181819
  3000      0.288899       186916
  3200      0.333995       196218
  3400      0.415724       189087
  3600      0.494598       188662
  3800      0.565289       194137
  4000      0.662003       193352
  4200      0.762732       194270
  4400      0.829545       205375
  4600      0.866307       224714
  4800      0.953990       231851
  5000      1.100728       227122
  5200      1.219832       230536
  5400      1.368855       230066
  5600      1.569290       223815
  5800      1.789924       218011
  6000      2.022892       213555
  6200      1.990262       239494
  6400      2.092950       250501
  6600      2.339395       245786
  6800      2.555408       246091
  7000      2.725500       251696
  7200      2.983605       250199
  7400      3.627176       223437
  7600      3.460915       253676
  7800      3.715710       255430
  8000      4.046330       253068


rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_sgemm.exe
Matrix  Time in       Performance
Size    Seconds       in Megaflops
------  ------------  ------------
   400      0.711834          179
   600      0.021887        19738
   800      0.029878        34273
  1000      0.030939        64643
  1200      0.035729        96728
  1400      0.042776       128295
  1600      0.050936       160829
  1800      0.061199       190590
  2000      0.071055       225176
  2200      0.083770       254220
  2400      0.092456       299038
  2600      0.108912       322755
  2800      0.122735       357714
  3000      0.132517       407495
  3200      0.158700       412954
  3400      0.190118       413468
  3600      0.211197       441824
  3800      0.244340       449143
  4000      0.273162       468585
  4200      0.352204       420711
  4400      0.388507       438519
  4600      0.404213       481607
  4800      0.450925       490511
  5000      0.492675       507434
  5200      0.560158       502030
  5400      0.652638       482546
  5600      0.721320       486929
  5800      0.721891       540558
  6000      0.828848       521205
  6200      0.966798       493025
  6400      0.994621       527123
  6600      1.134357       506888
  6800      1.240944       506762
  7000      1.281981       535109
  7200      1.296301       575866
  7400      1.446126       560427
  7600      1.502031       584510
  7800      1.653382       574038
  8000      1.928930       530864
rolly@rolly-p5q-pro:~/GPGPUexamples$

0 Likes
rollyng
Journeyman III

caldgemm with HD6990s

Now the caldgemm,

rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 4096 -n 4096
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b58989a8010, B=0x2b589a9e9010, C=0x2b589c9fa010, (C-A=8430592, (C-B)/w=4104))
Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.208 System Gflops 165.602

rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 8192 -n 8192
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2b1edc693010, B=0x2b1ee0714010, C=0x2b1ee4725010, (C-A=16851968, (C-B)/w=8200))
Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 0.581 System Gflops 236.899

0 Likes
rollyng
Journeyman III

caldgemm with HD6990s

What I can conclude so far:

(1) Only 1 HD6990 can only run on these libraries no matter how many extra of these are installed?

(2) acml-gpu looks having batter performance?

Thanks for reading!

0 Likes
laobrasuca
Journeyman III

caldgemm with HD6990s

(1) You mean 1 HD6970, right? Well, it seems to be it, since the performance is roughly 1/5 of the nominal TERAFLOP number in the best of the two libraries. Can anyone confirm this?

(2) There's something I didn't understand. Is the matrices size for acmlgpu-1.1.2 equal to 8000x8000? Because caldgemm do the product for 8192x1024, which is a huge difference. Since the size of the matrix has a visible influence on the Gflops, we can hardly compare these results. Could you re-run them for comparable matrix sizes? However, I don't know if the algorithm is optimized for square matrices or not. 

And, if you have a little more of time, could you test the clAmdBlasDegmm (and maybe clAmdBlasSegmm to compare to time_sgemm.exe)?

0 Likes
Gametimehero
Journeyman III

caldgemm with HD6990s

I am having similar memory issues.  I am running an HD5870.  I tried following the instructions given on the Wiki.

Here are things I ddi not do:

Instead of using git, I just downloaded/unzipped the latest version from the Files Page

-march=native didn't work so I just deleted it from makefile so that it can compile error-free.

I did not use the binary patch for the Catalyst driver.

 

Here is my output given your instructions of ./dgemm_bench -g -z -v -d

./dgemm_bench -g -z -v -d
Use -? for help
Init Caldgemm, setting CPU mask 1
CAL Runtime Version:1.4.1016
Initializing CAL
Initializing CALDGEMM for 1 devices
Allocating Host buffer for device 0 obuffer 0 buffer 0
Allocating device buffer for device 0 obuffer 0 buffer 0
Allocating Host buffer for device 0 obuffer 0 buffer 1
Allocating device buffer for device 0 obuffer 0 buffer 1
Allocating Host buffer for device 0 obuffer 0 buffer 2
Allocating device buffer for device 0 obuffer 0 buffer 2
Allocating Host buffer for device 0 obuffer 0 buffer 3
Allocating device buffer for device 0 obuffer 0 buffer 3
Allocating Host memory for device 0 obuffer 0 buffer 4
Allocating device buffer for device 0 obuffer 0 buffer 5
Allocating device buffer for device 0 obuffer 0 buffer 6
Allocating device buffer for device 0 obuffer 0 buffer 7
Allocating device buffer for device 0 obuffer 0 buffer 8
Allocating device buffer for device 0 obuffer 0 buffer 9
Allocating device buffer for device 0 obuffer 0 buffer 10
Allocating device buffer for device 0 obuffer 0 buffer 11
Allocating device buffer for device 0 obuffer 0 buffer 12
Allocating Host Constant buffer device 0 context 0 buffer 4
Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7
Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7
Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7
Merger Thread 0 started
Merge Thread 0, setting CPU mask 2
Allocating Host buffer for device 0 obuffer 1 buffer 0
Allocating device buffer for device 0 obuffer 1 buffer 0
Allocating Host buffer for device 0 obuffer 1 buffer 1
Allocating device buffer for device 0 obuffer 1 buffer 1
Allocating Host buffer for device 0 obuffer 1 buffer 2
Allocating device buffer for device 0 obuffer 1 buffer 2
Allocating Host buffer for device 0 obuffer 1 buffer 3
Allocating device buffer for device 0 obuffer 1 buffer 3
Allocating device buffer for device 0 obuffer 1 buffer 5
Allocating device buffer for device 0 obuffer 1 buffer 6
Allocating device buffer for device 0 obuffer 1 buffer 7
Allocating device buffer for device 0 obuffer 1 buffer 8
Allocating device buffer for device 0 obuffer 1 buffer 9
Allocating device buffer for device 0 obuffer 1 buffer 10
There was an error in allocating resources and binding them to memory
Error initializing CALDGEMM

 

Thanks

0 Likes