Hi, I did some test with acmlgpu1.1.2. as I run the Info.exe, it shows
hi there, what's the difference between ACML-GPU and clAmdBlas? Would it be that one is CAL and the other OpenCL? And what about performance (at least for single GPU setup)?
Hi, I think you are right. clAmdBlas needs OpenCL but I find that there is only sgemm example for clAmdBlas, so I may not be able to compare dgemm performance of the two libraries?
yes, there's only the sgemm example (but be aware that this example has a typo fault - matrix A is written to the bufB - check this post http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=150952&enterthread=y), but, well, they perform exactly the same mathematical operations except for the data type (double instead of float), so I believe you can use the exact same example changing the types only and having a card that supports double precision computations (like the 6790 of yours).
if you comprare performance results with caldgemm, please let us know.
OK, let's have the acmlgpu-1.1.2 first for both dgemm and sgemm on single HD6970.
rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_dgemm.exe
Matrix Time in Performance
Size Seconds in Megaflops
------ ------------ ------------
400 0.758880 168
600 0.030139 14333
800 0.035727 28662
1000 0.049771 40184
1200 0.067966 50849
1400 0.068995 79541
1600 0.086494 94711
1800 0.111565 104549
2000 0.134369 119075
2200 0.161584 131795
2400 0.184988 149458
2600 0.214080 164200
2800 0.241470 181819
3000 0.288899 186916
3200 0.333995 196218
3400 0.415724 189087
3600 0.494598 188662
3800 0.565289 194137
4000 0.662003 193352
4200 0.762732 194270
4400 0.829545 205375
4600 0.866307 224714
4800 0.953990 231851
5000 1.100728 227122
5200 1.219832 230536
5400 1.368855 230066
5600 1.569290 223815
5800 1.789924 218011
6000 2.022892 213555
6200 1.990262 239494
6400 2.092950 250501
6600 2.339395 245786
6800 2.555408 246091
7000 2.725500 251696
7200 2.983605 250199
7400 3.627176 223437
7600 3.460915 253676
7800 3.715710 255430
8000 4.046330 253068
rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_sgemm.exe
Matrix Time in Performance
Size Seconds in Megaflops
------ ------------ ------------
400 0.711834 179
600 0.021887 19738
800 0.029878 34273
1000 0.030939 64643
1200 0.035729 96728
1400 0.042776 128295
1600 0.050936 160829
1800 0.061199 190590
2000 0.071055 225176
2200 0.083770 254220
2400 0.092456 299038
2600 0.108912 322755
2800 0.122735 357714
3000 0.132517 407495
3200 0.158700 412954
3400 0.190118 413468
3600 0.211197 441824
3800 0.244340 449143
4000 0.273162 468585
4200 0.352204 420711
4400 0.388507 438519
4600 0.404213 481607
4800 0.450925 490511
5000 0.492675 507434
5200 0.560158 502030
5400 0.652638 482546
5600 0.721320 486929
5800 0.721891 540558
6000 0.828848 521205
6200 0.966798 493025
6400 0.994621 527123
6600 1.134357 506888
6800 1.240944 506762
7000 1.281981 535109
7200 1.296301 575866
7400 1.446126 560427
7600 1.502031 584510
7800 1.653382 574038
8000 1.928930 530864
rolly@rolly-p5q-pro:~/GPGPUexamples$
Now the caldgemm,
rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 4096 -n 4096
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b58989a8010, B=0x2b589a9e9010, C=0x2b589c9fa010, (C-A=8430592, (C-B)/w=4104))
Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.208 System Gflops 165.602
rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 8192 -n 8192
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2b1edc693010, B=0x2b1ee0714010, C=0x2b1ee4725010, (C-A=16851968, (C-B)/w=8200))
Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 0.581 System Gflops 236.899
What I can conclude so far:
(1) Only 1 HD6990 can only run on these libraries no matter how many extra of these are installed?
(2) acml-gpu looks having batter performance?
Thanks for reading!
(1) You mean 1 HD6970, right? Well, it seems to be it, since the performance is roughly 1/5 of the nominal TERAFLOP number in the best of the two libraries. Can anyone confirm this?
(2) There's something I didn't understand. Is the matrices size for acmlgpu-1.1.2 equal to 8000x8000? Because caldgemm do the product for 8192x1024, which is a huge difference. Since the size of the matrix has a visible influence on the Gflops, we can hardly compare these results. Could you re-run them for comparable matrix sizes? However, I don't know if the algorithm is optimized for square matrices or not.
And, if you have a little more of time, could you test the clAmdBlasDegmm (and maybe clAmdBlasSegmm to compare to time_sgemm.exe)?
I am having similar memory issues. I am running an HD5870. I tried following the instructions given on the Wiki.
Here are things I ddi not do:
Instead of using git, I just downloaded/unzipped the latest version from the Files Page
-march=native didn't work so I just deleted it from makefile so that it can compile error-free.
I did not use the binary patch for the Catalyst driver.
Here is my output given your instructions of ./dgemm_bench -g -z -v -d
./dgemm_bench -g -z -v -d
Use -? for help
Init Caldgemm, setting CPU mask 1
CAL Runtime Version:1.4.1016
Initializing CAL
Initializing CALDGEMM for 1 devices
Allocating Host buffer for device 0 obuffer 0 buffer 0
Allocating device buffer for device 0 obuffer 0 buffer 0
Allocating Host buffer for device 0 obuffer 0 buffer 1
Allocating device buffer for device 0 obuffer 0 buffer 1
Allocating Host buffer for device 0 obuffer 0 buffer 2
Allocating device buffer for device 0 obuffer 0 buffer 2
Allocating Host buffer for device 0 obuffer 0 buffer 3
Allocating device buffer for device 0 obuffer 0 buffer 3
Allocating Host memory for device 0 obuffer 0 buffer 4
Allocating device buffer for device 0 obuffer 0 buffer 5
Allocating device buffer for device 0 obuffer 0 buffer 6
Allocating device buffer for device 0 obuffer 0 buffer 7
Allocating device buffer for device 0 obuffer 0 buffer 8
Allocating device buffer for device 0 obuffer 0 buffer 9
Allocating device buffer for device 0 obuffer 0 buffer 10
Allocating device buffer for device 0 obuffer 0 buffer 11
Allocating device buffer for device 0 obuffer 0 buffer 12
Allocating Host Constant buffer device 0 context 0 buffer 4
Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7
Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7
Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7
Merger Thread 0 started
Merge Thread 0, setting CPU mask 2
Allocating Host buffer for device 0 obuffer 1 buffer 0
Allocating device buffer for device 0 obuffer 1 buffer 0
Allocating Host buffer for device 0 obuffer 1 buffer 1
Allocating device buffer for device 0 obuffer 1 buffer 1
Allocating Host buffer for device 0 obuffer 1 buffer 2
Allocating device buffer for device 0 obuffer 1 buffer 2
Allocating Host buffer for device 0 obuffer 1 buffer 3
Allocating device buffer for device 0 obuffer 1 buffer 3
Allocating device buffer for device 0 obuffer 1 buffer 5
Allocating device buffer for device 0 obuffer 1 buffer 6
Allocating device buffer for device 0 obuffer 1 buffer 7
Allocating device buffer for device 0 obuffer 1 buffer 8
Allocating device buffer for device 0 obuffer 1 buffer 9
Allocating device buffer for device 0 obuffer 1 buffer 10
There was an error in allocating resources and binding them to memory
Error initializing CALDGEMM
Thanks