cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rollyng
Journeyman III

caldgemm with HD6990s

Can I run caldgemm on HD6990s?

Hi,

I am sorry that I cannot find the caldgemm forum so I post here and hopefully its developers read this message.

I have 4 HD6990s and I really like to see how they perform in GFLOPs, so I come across this tool, http://code.compeng.uni-frankfurt.de/projects/caldgemm/wiki

I have Ubuntu 10.10 x86_64 + AMD driver 11.5 + SDK 2.4 and I followed the instructions on the wiki, however I have to add "make TARGET=NEHALEM NO_MEMPOLICY=1 -j" in order to compile GotoBLAS2 since I have two E5620 CPUs.

I compile caldgemm, its outputs look fine.

But If I run it, it prompts error and hangs idle, can anyone help? Thanks!

rolly@rolly-X8DTG-QF:~/caldgemm$ make g++ -c caldgemm.cpp -Wfloat-equal -Wpointer-arith -DATI_OS_LINUX -g3 -ffor-scope -O3 -march=core2 -ftree-vectorize -msse3 -fkeep-inline-functions -fweb -frename-registers -minline-all-stringops -funit-at-a-time -mfpmath=sse -ftracer -finline-limit=1200 -fpeel-loops -D_NO_AMD_CPU -I ../GotoBLAS2 -I /home/rolly/AMD-APP-SDK-v2.4-lnx64/include/CAL caldgemm.cpp: In function ‘void* divide_wrapper(void*)’: caldgemm.cpp:1661: warning: format ‘%lld’ expects type ‘long long int’, but argument 4 has type ‘int’ caldgemm.cpp: In member function ‘int caldgemm::RunCALDGEMM(double*, double*, double*, double, double, size_t, size_t, size_t, size_t, size_t, size_t, CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int)’: caldgemm.cpp:2260: warning: format ‘%lld’ expects type ‘long long int’, but argument 3 has type ‘size_t’ caldgemm.cpp:2269: warning: format ‘%lld’ expects type ‘long long int’, but argument 3 has type ‘size_t’ caldgemm.cpp:2403: warning: format ‘%lld’ expects type ‘long long int’, but argument 4 has type ‘size_t’ caldgemm.cpp:2421: warning: format ‘%lld’ expects type ‘long long int’, but argument 4 has type ‘size_t’ caldgemm.cpp: In member function ‘int caldgemm::DGEMM_prepare(size_t, int, unsigned int)’: caldgemm.cpp:2738: warning: format ‘%d’ expects type ‘int’, but argument 4 has type ‘size_t’ caldgemm.cpp:2761: warning: format ‘%d’ expects type ‘int’, but argument 4 has type ‘size_t’ g++ -c benchmark.cpp -Wfloat-equal -Wpointer-arith -DATI_OS_LINUX -g3 -ffor-scope -O3 -march=core2 -ftree-vectorize -msse3 -fkeep-inline-functions -fweb -frename-registers -minline-all-stringops -funit-at-a-time -mfpmath=sse -ftracer -finline-limit=1200 -fpeel-loops -D_NO_AMD_CPU -I ../GotoBLAS2 -I /home/rolly/AMD-APP-SDK-v2.4-lnx64/include/CAL g++ -o dgemm_bench caldgemm.o benchmark.o -lpthread -ldl -L/usr/X11R6/lib -laticalrt -laticalcl -lgfortran ../GotoBLAS2/libgoto2.a rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c Use -? for help Cannot use multiple devices without multithreading Segmentation fault rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -z Use -? for help There was an error in allocating resources and binding them to memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc AERROR locking Pages ...alloc BERROR locking Pages ...alloc CERROR locking Pages Memory allocation error allocating matrices

0 Likes
27 Replies
Marix
Adept II

There currently is no forum for caldgemm, however there is a (low volume) mailing list at https://compeng.uni-frankfurt.de/mailman/listinfo/caldgemm .

What I can see from your output ist the following:

  • Use -z if you want to use both GPUs.
  • Memory allocation fails. I assume your ulimit for max locked memory is too low. For machines on which you want to benchmark it makes sense to set "ulimit -l unlimited". You might also want to specify that in /etc/security/limits.conf. 
0 Likes

Hi Marix,

if I do

rolly@rolly-X8DTG-QF:~$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

So I made some change to /etc/security/limits.conf

http://www.akadia.com/services/ora_enable_core.html

now I can change ulimit -l unlimited and it looks like

rolly@rolly-X8DTG-QF:~/caldgemm$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Now running benchmark,

rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b2392afc010, B=0x2b2394b3d010, C=0x2b2396b4e010, (C-A=8430592, (C-B)/w=4104))
Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.656 System Gflops 52.459

But with -z option, it still failed,

rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -z
Use -? for help
There was an error in allocating resources and binding them to memory
Error initializing CALDGEMM

Any hint on the last error? Thank you!

0 Likes

HI Marix,

Further update to my problem, I think this is due to multiple GPU issues. I did the same for another system with same software config but just a single HD6970, it nnow produce the reasonable results:

Is this true that the system performance just 164 GFLOPS vs kernel 465 GFLOPS for a single GPU HD6970.

For my 4x HD6990s, the -g parameter does not work at all...

Thanks!

rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -c Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ab9dea74010, B=0x2ab9e0ab5010, C=0x2ab9e2ac6010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 1.652 System Gflops 20.822 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ad9ccb3c010, B=0x2ad9ceb7d010, C=0x2ad9d0b8e010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.210 System Gflops 163.418 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -v Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ad235f2a010, B=0x2ad237f6b010, C=0x2ad239f7c010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.210 System Gflops 163.892 Times: Kernel Divide (1,1) Merge Copy To Copy From 0.0737 (465.7270 Gflops) 0.0296 (2.2666 GB/s) 0.0934 (1.4375 GB/s) 0.0128 (5.2401 GB/s) 0.0000 (0.0000 Gb/s)

0 Likes

Regarding the low system performance. There is currently a known performance issue with all HD6000 series devices. It can be tuned around and there is a new version with some proper workarounds in the queue. However the copy speeds look aktually quite good in your case. Your matrix size is, however, rather small. Why you should get quite some performance at that size it would be interesting to see what you can reach at 20k or even 40k for m and n (k may stay at 1024).

0 Likes

Originally posted by: Marix Regarding the low system performance. There is currently a known performance issue with all HD6000 series devices. It can be tuned around and there is a new version with some proper workarounds in the queue. However the copy speeds look aktually quite good in your case. Your matrix size is, however, rather small. Why you should get quite some performance at that size it would be interesting to see what you can reach at 20k or even 40k for m and n (k may stay at 1024).

 

Hi Marix, thanks for your info, I rerun the test this time on the single HD6970 with 4GB host memory, so I can only run m=n=16384.

Please have a look at the output., the best I get is 212 GFLOPS. Thank you

 

rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ae9c5ae4010, B=0x2ae9c7b25010, C=0x2ae9c9b36010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.328 System Gflops 104.980 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -m 8192 -n 8192 Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2af03c0d3010, B=0x2af040154010, C=0x2af044165010, (C-A=16851968, (C-B)/w=8200)) Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 1.003 System Gflops 137.169 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -m 16384 -n 16384 Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=16384 k=1024 n=16384 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x4008 LDC=0x4008 At=0 Bt=0 ColMajor=0 (A=0x2b02a29f4010, B=0x2b02aaaf5010, C=0x2b02b2b06010, (C-A=33694720, (C-B)/w=16392)) Program: caldgemm Sizes - A: 16384x1024 B: 1024x16384 C:16384x16384 (Host: rolly-p5q-pro) System Time 3.640 System Gflops 151.174 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -m 32768 -n 32768 Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc Cterminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped) rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -m 8192 -n 8192 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2b2177e92010, B=0x2b217bf13010, C=0x2b217ff24010, (C-A=16851968, (C-B)/w=8200)) Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 0.748 System Gflops 184.007 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -m 16384 -n 16384 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=16384 k=1024 n=16384 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x4008 LDC=0x4008 At=0 Bt=0 ColMajor=0 (A=0x2ba071595010, B=0x2ba079696010, C=0x2ba0816a7010, (C-A=33694720, (C-B)/w=16392)) Program: caldgemm Sizes - A: 16384x1024 B: 1024x16384 C:16384x16384 (Host: rolly-p5q-pro) System Time 2.587 System Gflops 212.735 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -m 32768 -n 32768 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc Cterminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped) rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -v -m 16384 -n 16384 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=16384 k=1024 n=16384 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x4008 LDC=0x4008 At=0 Bt=0 ColMajor=0 (A=0x2b344bd96010, B=0x2b3453e97010, C=0x2b345bea8010, (C-A=33694720, (C-B)/w=16392)) Program: caldgemm Sizes - A: 16384x1024 B: 1024x16384 C:16384x16384 (Host: rolly-p5q-pro) System Time 2.949 System Gflops 186.577 Times: Kernel Divide (4,4) Merge Copy To Copy From 1.2474 (440.5147 Gflops) 0.2862 (0.9380 GB/s) 0.4570 (0.0000 GB/s) 0.1072 (2.5045 GB/s) 0.0000 (0.0000 Gb/s)

0 Likes

Hi rollyng,

as marix said there is an issue related to 6000 series GPU that decreases system performance dramatically. However, the -z parameter should actually work.

to help debugging this problem can you do the following:

activate the DEBUG_MSG_ALLOCATION swith in caldgemm_config.h

set the STD_OUT parameter to stderr in caldgemm_config.h

run dgemm_bench -g -z -v -d and paste the output.

can you please also tell me exactly which version you are using?

Cheers

0 Likes

Hi David,

Thanks for your message, I did the following for the cpu run, please take a look first.

rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Cannot use multiple devices without multithreading Initializing CALDGEMM for 1 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Was able to allocate 21 bbuffers Using 8 CPU cores at 2401 MHz, 1 GPUs of 1536 shaders at 830 MHz Caldgemm Init complete, setting CPU mask 80 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2afa3f516010, B=0x2afa41557010, C=0x2afa43568010, (C-A=8430592, (C-B)/w=4104)) Running CPU only DGEMM DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.542 System Gflops 63.429 Times: Kernel Divide (0,0) Merge Copy To Copy From 0.0000 (inf Gflops) 0.0000 (-nan GB/s) 0.0000 (inf GB/s) 0.0000 (-nan GB/s) 0.0000 (0.0000 Gb/s) Uninitializing CALDGEMM Uninitializing buffers for device 0 context 0 Freeing CAL Host memory, device 0 context 0 buffer 0 Freeing temporary CAL memory, device 0 context 0 buffer 0 Freeing CAL Host memory, device 0 context 0 buffer 1 Freeing temporary CAL memory, device 0 context 0 buffer 1 Freeing CAL Host memory, device 0 context 0 buffer 2 Freeing temporary CAL memory, device 0 context 0 buffer 2 Freeing CAL Host memory, device 0 context 0 buffer 3 Freeing temporary CAL memory, device 0 context 0 buffer 3 Freeing CAL Host memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 0 Freeing CAL GPU memory, device 0 context 0 buffer 1 Freeing CAL GPU memory, device 0 context 0 buffer 2 Freeing CAL GPU memory, device 0 context 0 buffer 3 Freeing CAL GPU memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 5 Freeing CAL GPU memory, device 0 context 0 buffer 6 Freeing CAL GPU memory, device 0 context 0 buffer 7 Freeing CAL GPU memory, device 0 context 0 buffer 8 Freeing CAL GPU memory, device 0 context 0 buffer 9 Freeing CAL GPU memory, device 0 context 0 buffer 10 Freeing CAL GPU memory, device 0 context 0 buffer 11 Freeing CAL GPU memory, device 0 context 0 buffer 12 Uninitializing buffers for device 0 context 1 Freeing CAL Host memory, device 0 context 1 buffer 0 Freeing temporary CAL memory, device 0 context 1 buffer 0 Freeing CAL Host memory, device 0 context 1 buffer 1 Freeing temporary CAL memory, device 0 context 1 buffer 1 Freeing CAL Host memory, device 0 context 1 buffer 2 Freeing temporary CAL memory, device 0 context 1 buffer 2 Freeing CAL Host memory, device 0 context 1 buffer 3 Freeing temporary CAL memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 0 Freeing CAL GPU memory, device 0 context 1 buffer 1 Freeing CAL GPU memory, device 0 context 1 buffer 2 Freeing CAL GPU memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 5 Freeing CAL GPU memory, device 0 context 1 buffer 6 Freeing CAL GPU memory, device 0 context 1 buffer 7 Freeing CAL GPU memory, device 0 context 1 buffer 8 Freeing CAL GPU memory, device 0 context 1 buffer 9 Freeing CAL GPU memory, device 0 context 1 buffer 10 Freeing CAL GPU memory, device 0 context 1 buffer 11 Freeing CAL GPU memory, device 0 context 1 buffer 12 Uninitializing buffers for device 0 context 2 Freeing CAL GPU memory, device 0 context 2 buffer 2 Freeing CAL GPU memory, device 0 context 2 buffer 3 Freeing CAL GPU memory, device 0 context 2 buffer 5 Freeing CAL GPU memory, device 0 context 2 buffer 6 Freeing CAL GPU memory, device 0 context 2 buffer 7 Freeing CAL GPU memory, device 0 context 2 buffer 8 Freeing CAL GPU memory, device 0 context 2 buffer 9 Freeing CAL GPU memory, device 0 context 2 buffer 10 Freeing CAL GPU memory, device 0 context 2 buffer 11 Freeing CAL GPU memory, device 0 context 2 buffer 12 Uninitializing buffers for device 0 context 3 Freeing CAL GPU memory, device 0 context 3 buffer 2 Freeing CAL GPU memory, device 0 context 3 buffer 3 Uninitializing buffers for device 0 context 4 Freeing CAL GPU memory, device 0 context 4 buffer 2 Freeing CAL GPU memory, device 0 context 4 buffer 3 Uninitializing buffers for device 0 context 5 Freeing CAL GPU memory, device 0 context 5 buffer 2 Freeing CAL GPU memory, device 0 context 5 buffer 3 Uninitializing buffers for device 0 context 6 Freeing CAL GPU memory, device 0 context 6 buffer 2 Freeing CAL GPU memory, device 0 context 6 buffer 3 Uninitializing buffers for device 0 context 7 Freeing CAL GPU memory, device 0 context 7 buffer 2 Freeing CAL GPU memory, device 0 context 7 buffer 3 Uninitializing buffers for device 0 context 8 Freeing CAL GPU memory, device 0 context 8 buffer 2 Freeing CAL GPU memory, device 0 context 8 buffer 3 Uninitializing buffers for device 0 context 9 Freeing CAL GPU memory, device 0 context 9 buffer 2 Freeing CAL GPU memory, device 0 context 9 buffer 3 Uninitializing buffers for device 0 context 10 Freeing CAL GPU memory, device 0 context 10 buffer 2 Freeing CAL GPU memory, device 0 context 10 buffer 3 Uninitializing buffers for device 0 context 11 Freeing CAL GPU memory, device 0 context 11 buffer 2 Freeing CAL GPU memory, device 0 context 11 buffer 3 Uninitializing buffers for device 0 context 12 Freeing CAL GPU memory, device 0 context 12 buffer 2 Freeing CAL GPU memory, device 0 context 12 buffer 3 Uninitializing buffers for device 0 context 13 Freeing CAL GPU memory, device 0 context 13 buffer 2 Freeing CAL GPU memory, device 0 context 13 buffer 3 Uninitializing buffers for device 0 context 14 Freeing CAL GPU memory, device 0 context 14 buffer 2 Freeing CAL GPU memory, device 0 context 14 buffer 3 Uninitializing buffers for device 0 context 15 Freeing CAL GPU memory, device 0 context 15 buffer 2 Freeing CAL GPU memory, device 0 context 15 buffer 3 Uninitializing buffers for device 0 context 16 Freeing CAL GPU memory, device 0 context 16 buffer 2 Freeing CAL GPU memory, device 0 context 16 buffer 3 Uninitializing buffers for device 0 context 17 Freeing CAL GPU memory, device 0 context 17 buffer 2 Freeing CAL GPU memory, device 0 context 17 buffer 3 Uninitializing buffers for device 0 context 18 Freeing CAL GPU memory, device 0 context 18 buffer 2 Freeing CAL GPU memory, device 0 context 18 buffer 3 Uninitializing buffers for device 0 context 19 Freeing CAL GPU memory, device 0 context 19 buffer 2 Freeing CAL GPU memory, device 0 context 19 buffer 3 Uninitializing buffers for device 0 context 20 Freeing CAL GPU memory, device 0 context 20 buffer 2 Freeing CAL GPU memory, device 0 context 20 buffer 3 Uninitializing context for device 0 Uninitializing CAL runtime rolly@rolly-X8DTG-QF:~/caldgemm$

0 Likes

Here is the single GPU run on one of these HD6990s.

rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Cannot use multiple devices without multithreading Initializing CALDGEMM for 1 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Was able to allocate 21 bbuffers Using 8 CPU cores at 1600 MHz, 1 GPUs of 1536 shaders at 830 MHz Caldgemm Init complete, setting CPU mask 80 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ad1b34c6010, B=0x2ad1b5507010, C=0x2ad1b7518010, (C-A=8430592, (C-B)/w=4104)) Using Kernel 2 (alpha=0xBFF0000000000000 (-1.000), width = 1024) Caldgemm Main Thread, setting CPU mask 1 Initiliazing GPU Constant Buffers...0 Done GPU Curve Ration: 0.70, CPUScale 0.18, GPUScale 1.17 GPURatio automatically set to 0.94 Favoring m direction, 1 blocks Iteration k = 0, m = 0, n = 0 (device 0 obuffer 0) Running Preprocessing device = 0 k = 0 Dividing Buffer A (device = 0, k = 0, buffer = 0) SRC=0x2ad1b34c6010, w: 1024, h: 4096, pitch: 1032 (gpuw: 1024, gpuh: 4096, transpose: 0) Dividing Buffer B (device = 0, k = 0, buffer = 0) SRC=0x2ad1b5507010, w: 1024, h: 4096, pitch: 4104 (gpuw: 1024, gpuh: 4096, transpose: 1) Copying part of A to GPU (k = 0, m = 0, n = 0) Starting conversion kernel Total Kernel Time: 0.0006 Copying part of B to GPU (k = 0, m = 0, n = 0) Starting conversion kernel Total Kernel Time: 0.0194 Waiting for event from device 0 obuffer 0... Executing MM kernel (device 0 obuffer 0, k=0 m=0 n=0) Total Kernel Time: 0.6100 Processing Output (Iteration 1) for device 0 tile 0 (m = 0, n = 0) Waiting for event from device 0 obuffer 0... Merging buffer (device 0, obuffer 0, k = 0, main thread) Main thread unlocking obuffer mutex devuce 0 obuffer 0 Processing Output (Iteration 2) for device 0 tile 1 (m = 1, n = 0) Waiting for event from device 0 obuffer 1... Caldgemm Main Thread, setting CPU mask 80 DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.731 System Gflops 47.081 Times: Kernel Divide (1,1) Merge Copy To Copy From 0.6100 (56.2969 Gflops) 0.0191 (3.5123 GB/s) 0.0696 (1.9273 GB/s) 0.0301 (2.2296 GB/s) 0.0000 (0.0000 Gb/s) Uninitializing CALDGEMM Uninitializing buffers for device 0 context 0 Freeing CAL Host memory, device 0 context 0 buffer 0 Freeing temporary CAL memory, device 0 context 0 buffer 0 Freeing CAL Host memory, device 0 context 0 buffer 1 Freeing temporary CAL memory, device 0 context 0 buffer 1 Freeing CAL Host memory, device 0 context 0 buffer 2 Freeing temporary CAL memory, device 0 context 0 buffer 2 Freeing CAL Host memory, device 0 context 0 buffer 3 Freeing temporary CAL memory, device 0 context 0 buffer 3 Freeing CAL Host memory, device 0 context 0 buffer 4 Freeing CAL Host memory, device 0 context 0 buffer 5 Freeing CAL Host memory, device 0 context 0 buffer 6 Freeing CAL Host memory, device 0 context 0 buffer 7 Freeing CAL Host memory, device 0 context 0 buffer 8 Freeing CAL Host memory, device 0 context 0 buffer 9 Freeing CAL Host memory, device 0 context 0 buffer 10 Freeing CAL Host memory, device 0 context 0 buffer 11 Freeing CAL Host memory, device 0 context 0 buffer 12 Freeing CAL GPU memory, device 0 context 0 buffer 0 Freeing CAL GPU memory, device 0 context 0 buffer 1 Freeing CAL GPU memory, device 0 context 0 buffer 2 Freeing CAL GPU memory, device 0 context 0 buffer 3 Freeing CAL GPU memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 5 Freeing CAL GPU memory, device 0 context 0 buffer 6 Freeing CAL GPU memory, device 0 context 0 buffer 7 Freeing CAL GPU memory, device 0 context 0 buffer 8 Freeing CAL GPU memory, device 0 context 0 buffer 9 Freeing CAL GPU memory, device 0 context 0 buffer 10 Freeing CAL GPU memory, device 0 context 0 buffer 11 Freeing CAL GPU memory, device 0 context 0 buffer 12 Uninitializing buffers for device 0 context 1 Freeing CAL Host memory, device 0 context 1 buffer 0 Freeing temporary CAL memory, device 0 context 1 buffer 0 Freeing CAL Host memory, device 0 context 1 buffer 1 Freeing temporary CAL memory, device 0 context 1 buffer 1 Freeing CAL Host memory, device 0 context 1 buffer 2 Freeing temporary CAL memory, device 0 context 1 buffer 2 Freeing CAL Host memory, device 0 context 1 buffer 3 Freeing temporary CAL memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 0 Freeing CAL GPU memory, device 0 context 1 buffer 1 Freeing CAL GPU memory, device 0 context 1 buffer 2 Freeing CAL GPU memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 5 Freeing CAL GPU memory, device 0 context 1 buffer 6 Freeing CAL GPU memory, device 0 context 1 buffer 7 Freeing CAL GPU memory, device 0 context 1 buffer 8 Freeing CAL GPU memory, device 0 context 1 buffer 9 Freeing CAL GPU memory, device 0 context 1 buffer 10 Freeing CAL GPU memory, device 0 context 1 buffer 11 Freeing CAL GPU memory, device 0 context 1 buffer 12 Uninitializing buffers for device 0 context 2 Freeing CAL GPU memory, device 0 context 2 buffer 2 Freeing CAL GPU memory, device 0 context 2 buffer 3 Freeing CAL GPU memory, device 0 context 2 buffer 5 Freeing CAL GPU memory, device 0 context 2 buffer 6 Freeing CAL GPU memory, device 0 context 2 buffer 7 Freeing CAL GPU memory, device 0 context 2 buffer 8 Freeing CAL GPU memory, device 0 context 2 buffer 9 Freeing CAL GPU memory, device 0 context 2 buffer 10 Freeing CAL GPU memory, device 0 context 2 buffer 11 Freeing CAL GPU memory, device 0 context 2 buffer 12 Uninitializing buffers for device 0 context 3 Freeing CAL GPU memory, device 0 context 3 buffer 2 Freeing CAL GPU memory, device 0 context 3 buffer 3 Uninitializing buffers for device 0 context 4 Freeing CAL GPU memory, device 0 context 4 buffer 2 Freeing CAL GPU memory, device 0 context 4 buffer 3 Uninitializing buffers for device 0 context 5 Freeing CAL GPU memory, device 0 context 5 buffer 2 Freeing CAL GPU memory, device 0 context 5 buffer 3 Uninitializing buffers for device 0 context 6 Freeing CAL GPU memory, device 0 context 6 buffer 2 Freeing CAL GPU memory, device 0 context 6 buffer 3 Uninitializing buffers for device 0 context 7 Freeing CAL GPU memory, device 0 context 7 buffer 2 Freeing CAL GPU memory, device 0 context 7 buffer 3 Uninitializing buffers for device 0 context 8 Freeing CAL GPU memory, device 0 context 8 buffer 2 Freeing CAL GPU memory, device 0 context 8 buffer 3 Uninitializing buffers for device 0 context 9 Freeing CAL GPU memory, device 0 context 9 buffer 2 Freeing CAL GPU memory, device 0 context 9 buffer 3 Uninitializing buffers for device 0 context 10 Freeing CAL GPU memory, device 0 context 10 buffer 2 Freeing CAL GPU memory, device 0 context 10 buffer 3 Uninitializing buffers for device 0 context 11 Freeing CAL GPU memory, device 0 context 11 buffer 2 Freeing CAL GPU memory, device 0 context 11 buffer 3 Uninitializing buffers for device 0 context 12 Freeing CAL GPU memory, device 0 context 12 buffer 2 Freeing CAL GPU memory, device 0 context 12 buffer 3 Uninitializing buffers for device 0 context 13 Freeing CAL GPU memory, device 0 context 13 buffer 2 Freeing CAL GPU memory, device 0 context 13 buffer 3 Uninitializing buffers for device 0 context 14 Freeing CAL GPU memory, device 0 context 14 buffer 2 Freeing CAL GPU memory, device 0 context 14 buffer 3 Uninitializing buffers for device 0 context 15 Freeing CAL GPU memory, device 0 context 15 buffer 2 Freeing CAL GPU memory, device 0 context 15 buffer 3 Uninitializing buffers for device 0 context 16 Freeing CAL GPU memory, device 0 context 16 buffer 2 Freeing CAL GPU memory, device 0 context 16 buffer 3 Uninitializing buffers for device 0 context 17 Freeing CAL GPU memory, device 0 context 17 buffer 2 Freeing CAL GPU memory, device 0 context 17 buffer 3 Uninitializing buffers for device 0 context 18 Freeing CAL GPU memory, device 0 context 18 buffer 2 Freeing CAL GPU memory, device 0 context 18 buffer 3 Uninitializing buffers for device 0 context 19 Freeing CAL GPU memory, device 0 context 19 buffer 2 Freeing CAL GPU memory, device 0 context 19 buffer 3 Uninitializing buffers for device 0 context 20 Freeing CAL GPU memory, device 0 context 20 buffer 2 Freeing CAL GPU memory, device 0 context 20 buffer 3 Uninitializing context for device 0 Uninitializing CAL runtime rolly@rolly-X8DTG-QF:~/caldgemm$

0 Likes

Now I run -z for CPU only

rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c -z -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 8 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating temporary device buffer for device 1 context 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating temporary device buffer for device 1 context 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating temporary device buffer for device 1 context 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating temporary device buffer for device 1 context 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating temporary device buffer for device 1 context 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating temporary device buffer for device 1 context 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating temporary device buffer for device 1 context 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating temporary device buffer for device 1 context 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Allocating Host buffer for device 2 obuffer 0 buffer 0 Allocating device buffer for device 2 obuffer 0 buffer 0 Allocating temporary device buffer for device 2 context 0 buffer 0 Allocating Host buffer for device 2 obuffer 0 buffer 1 Allocating device buffer for device 2 obuffer 0 buffer 1 Allocating temporary device buffer for device 2 context 0 buffer 1 Allocating Host buffer for device 2 obuffer 0 buffer 2 Allocating device buffer for device 2 obuffer 0 buffer 2 Allocating temporary device buffer for device 2 context 0 buffer 2 Allocating Host buffer for device 2 obuffer 0 buffer 3 Allocating device buffer for device 2 obuffer 0 buffer 3 Allocating temporary device buffer for device 2 context 0 buffer 3 Allocating Host memory for device 2 obuffer 0 buffer 4 Allocating device buffer for device 2 obuffer 0 buffer 5 Allocating device buffer for device 2 obuffer 0 buffer 6 Allocating device buffer for device 2 obuffer 0 buffer 7 Allocating device buffer for device 2 obuffer 0 buffer 8 Allocating device buffer for device 2 obuffer 0 buffer 9 Allocating device buffer for device 2 obuffer 0 buffer 10 Allocating device buffer for device 2 obuffer 0 buffer 11 Allocating device buffer for device 2 obuffer 0 buffer 12 Allocating Host Constant buffer device 2 context 0 buffer 4 Getting module buffer name for device 2 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 20 Allocating Host buffer for device 2 obuffer 1 buffer 0 Allocating device buffer for device 2 obuffer 1 buffer 0 Allocating temporary device buffer for device 2 context 1 buffer 0 Allocating Host buffer for device 2 obuffer 1 buffer 1 Allocating device buffer for device 2 obuffer 1 buffer 1 Allocating temporary device buffer for device 2 context 1 buffer 1 Allocating Host buffer for device 2 obuffer 1 buffer 2 Allocating device buffer for device 2 obuffer 1 buffer 2 Allocating temporary device buffer for device 2 context 1 buffer 2 Allocating Host buffer for device 2 obuffer 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 3 Allocating temporary device buffer for device 2 context 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 5 Allocating device buffer for device 2 obuffer 1 buffer 6 Allocating device buffer for device 2 obuffer 1 buffer 7 Allocating device buffer for device 2 obuffer 1 buffer 8 Allocating device buffer for device 2 obuffer 1 buffer 9 Allocating device buffer for device 2 obuffer 1 buffer 10 Allocating device buffer for device 2 obuffer 1 buffer 11 Allocating device buffer for device 2 obuffer 1 buffer 12 There was an error in allocating resources and binding them to memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$

0 Likes

Finally -z for GPUs. To me the -z option does not work at all?

By the way, I just "git clone git://code.compeng.uni-frankfurt.de/caldgemm", am I having the latest version of caldgemm?

Thanks!

 

 

rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g -z -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 8 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating temporary device buffer for device 1 context 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating temporary device buffer for device 1 context 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating temporary device buffer for device 1 context 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating temporary device buffer for device 1 context 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating temporary device buffer for device 1 context 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating temporary device buffer for device 1 context 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating temporary device buffer for device 1 context 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating temporary device buffer for device 1 context 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Allocating Host buffer for device 2 obuffer 0 buffer 0 Allocating device buffer for device 2 obuffer 0 buffer 0 Allocating temporary device buffer for device 2 context 0 buffer 0 Allocating Host buffer for device 2 obuffer 0 buffer 1 Allocating device buffer for device 2 obuffer 0 buffer 1 Allocating temporary device buffer for device 2 context 0 buffer 1 Allocating Host buffer for device 2 obuffer 0 buffer 2 Allocating device buffer for device 2 obuffer 0 buffer 2 Allocating temporary device buffer for device 2 context 0 buffer 2 Allocating Host buffer for device 2 obuffer 0 buffer 3 Allocating device buffer for device 2 obuffer 0 buffer 3 Allocating temporary device buffer for device 2 context 0 buffer 3 Allocating Host memory for device 2 obuffer 0 buffer 4 Allocating device buffer for device 2 obuffer 0 buffer 5 Allocating device buffer for device 2 obuffer 0 buffer 6 Allocating device buffer for device 2 obuffer 0 buffer 7 Allocating device buffer for device 2 obuffer 0 buffer 8 Allocating device buffer for device 2 obuffer 0 buffer 9 Allocating device buffer for device 2 obuffer 0 buffer 10 Allocating device buffer for device 2 obuffer 0 buffer 11 Allocating device buffer for device 2 obuffer 0 buffer 12 Allocating Host Constant buffer device 2 context 0 buffer 4 Getting module buffer name for device 2 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 20 Allocating Host buffer for device 2 obuffer 1 buffer 0 Allocating device buffer for device 2 obuffer 1 buffer 0 Allocating temporary device buffer for device 2 context 1 buffer 0 Allocating Host buffer for device 2 obuffer 1 buffer 1 Allocating device buffer for device 2 obuffer 1 buffer 1 Allocating temporary device buffer for device 2 context 1 buffer 1 Allocating Host buffer for device 2 obuffer 1 buffer 2 Allocating device buffer for device 2 obuffer 1 buffer 2 Allocating temporary device buffer for device 2 context 1 buffer 2 Allocating Host buffer for device 2 obuffer 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 3 Allocating temporary device buffer for device 2 context 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 5 Allocating device buffer for device 2 obuffer 1 buffer 6 Allocating device buffer for device 2 obuffer 1 buffer 7 Allocating device buffer for device 2 obuffer 1 buffer 8 Allocating device buffer for device 2 obuffer 1 buffer 9 Allocating device buffer for device 2 obuffer 1 buffer 10 Allocating device buffer for device 2 obuffer 1 buffer 11 Allocating device buffer for device 2 obuffer 1 buffer 12 There was an error in allocating resources and binding them to memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$

0 Likes

Hi rollyng,

I tried to look into this, I plugged three 6970 GPUs in a node but I cannot reproduce the issue you see.

The log you posted tells me that the AMD runtime is unable to allocate host memory, i.e. I issue a malloc call for a page locked buffer but get an error message.

Could you plase update to the current git revision or apply the attached patch. The debug message will then provide the error code of the API which is needed to analyze this further.

As you said your system only has 4GB of memory you might be running out of page locked memory.

you can try to use two GPUs and see whether that works with: ./dgemm_bench -z -v -d -Y 2

 

Regards

--- a/caldgemm.cpp +++ b/caldgemm.cpp @@ -3383,7 +3383,7 @@ int caldgemm::SetupData(CALmodule *module, CALresource* &_Res, BufferProperties* calResFree(_Res); } - if (nContext < obuffercount) fprintf(STD_OUT, "There was an error in allocating resources and binding them to memory\n"); + if (nContext < obuffercount) fprintf(STD_OUT, "There was an error in allocating resources and binding them to memory (Error code %d)\n", r); else if (Config->Debug) fprintf(STD_OUT, "No more memory available for bbuffers\n"); return(1); }

0 Likes

HI, I recompiles the lastest with git pull,

with -c -z -d now it gives output:

 

rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c -z -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Was able to allocate 21 bbuffers Waiting for cblas slave to start Cblas helper thread started Cblas thread Thread, setting CPU mask 80 Waiting for linpack slave to start Using 8 CPU cores at 1600 MHz, 0 GPUs of 0 shaders at 0 MHz Caldgemm Init complete, setting CPU mask 80 Linpack helper thread started Linpack Thread, setting CPU mask 8 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2aedb2bae010, B=0x2aedb4bef010, C=0x2aedb6c00010, (C-A=8430592, (C-B)/w=4104)) Running CPU only DGEMM DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.558 System Gflops 61.684 Uninitializing CALDGEMM Uninitializing CAL runtime Trying to terminate linpack slave Waiting for linpack slave to terminate Waiting for merge threads to terminate linpack slave terminating rolly@rolly-X8DTG-QF:~/caldgemm$

0 Likes

Now with -g -z -d still ends with error:

 

rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g -z -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 8 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Clearing Memory at 0x2b94b3c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Clearing Memory at 0x2b94b4c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Clearing Memory at 0x2b94b5c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Clearing Memory at 0x2b94b6c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Clearing Memory at 0x3c43120, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Clearing Memory at 0x2b94b7e96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Clearing Memory at 0x2b94b8e96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Clearing Memory at 0x2b94b9e96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Clearing Memory at 0x2b94bae96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Clearing Memory at 0x2b94bc097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Clearing Memory at 0x2b94bd097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Clearing Memory at 0x2b94be097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Clearing Memory at 0x2b94bf097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Clearing Memory at 0x3c70ff0, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Clearing Memory at 0x2b94c0298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Clearing Memory at 0x2b94c1298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Clearing Memory at 0x2b94c2298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Clearing Memory at 0x2b94c3298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Allocating Host buffer for device 2 obuffer 0 buffer 0 Clearing Memory at 0x2b94c4499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 0 Allocating Host buffer for device 2 obuffer 0 buffer 1 Clearing Memory at 0x2b94c5499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 1 Allocating Host buffer for device 2 obuffer 0 buffer 2 Clearing Memory at 0x2b94c6499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 2 Allocating Host buffer for device 2 obuffer 0 buffer 3 Clearing Memory at 0x2b94c7499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 3 Allocating Host memory for device 2 obuffer 0 buffer 4 Clearing Memory at 0x3c9e810, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 5 Allocating device buffer for device 2 obuffer 0 buffer 6 Allocating device buffer for device 2 obuffer 0 buffer 7 Allocating device buffer for device 2 obuffer 0 buffer 8 Allocating device buffer for device 2 obuffer 0 buffer 9 Allocating device buffer for device 2 obuffer 0 buffer 10 Allocating device buffer for device 2 obuffer 0 buffer 11 Allocating device buffer for device 2 obuffer 0 buffer 12 Allocating Host Constant buffer device 2 context 0 buffer 4 Getting module buffer name for device 2 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 20 Allocating Host buffer for device 2 obuffer 1 buffer 0 Clearing Memory at 0x2b94c869a000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 1 buffer 0 Allocating Host buffer for device 2 obuffer 1 buffer 1 Clearing Memory at 0x2b94c969a000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 1 buffer 1 Allocating Host buffer for device 2 obuffer 1 buffer 2 Error 'Operational error' while allocattion of remote memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$

0 Likes

With 2 GPUs it finishes! So does it mean the current ver. of caldgemm cannot run on 4x 6990s (8 GPUs)? Thanks!

 

rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -z -v -d -Y 2 Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 2 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Clearing Memory at 0x2b40865a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Clearing Memory at 0x2b40875a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Clearing Memory at 0x2b40885a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Clearing Memory at 0x2b40895a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Clearing Memory at 0x3105020, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Clearing Memory at 0x2b408a7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Clearing Memory at 0x2b408b7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Clearing Memory at 0x2b408c7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Clearing Memory at 0x2b408d7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Clearing Memory at 0x2b408e9aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Clearing Memory at 0x2b408f9aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Clearing Memory at 0x2b40909aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Clearing Memory at 0x2b40919aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Clearing Memory at 0x3132ef0, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Clearing Memory at 0x2b4092bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Clearing Memory at 0x2b4093bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Clearing Memory at 0x2b4094bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Clearing Memory at 0x2b4095bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Was able to allocate 21 bbuffers Waiting for linpack slave to start Using 8 CPU cores at 1600 MHz, 2 GPUs of 1536 shaders at 830 MHz Caldgemm Init complete, setting CPU mask 80 Linpack helper thread started Linpack Thread, setting CPU mask 20 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b4096fad010, B=0x2b4098fee010, C=0x2b409afff010, (C-A=8430592, (C-B)/w=4104)) Using Kernel 2 (alpha=0xBFF0000000000000 (-1.000), width = 1024) Caldgemm Main Thread, setting CPU mask 1 Initiliazing GPU Constant Buffers...01 Done GPU Curve Ration: 0.70, CPUScale 0.12, GPUScale 2.34 GPURatio automatically set to 0.98 Favoring m direction, 1 blocks Iteration k = 0, m = 0, n = 0 (device 0 obuffer 0) Running Preprocessing device = 0 k = 0 Dividing Buffer A (device = 0, k = 0, buffer = 0) SRC=0x2b4096fad010, w: 1024, h: 4096, pitch: 1032 (gpuw: 1024, gpuh: 4096, transpose: 0) Dividing Buffer B (device = 0, k = 0, buffer = 0) SRC=0x2b4098fee010, w: 1024, h: 4096, pitch: 4104 (gpuw: 1024, gpuh: 4096, transpose: 1) Copying part of A to GPU (k = 0, m = 0, n = 0) Copying part of B to GPU (k = 0, m = 0, n = 0) Locking obuffer mutex 0/0 Waiting for event from device 0 obuffer 0... Executing MM kernel (device 0 obuffer 0, k=0 m=0 n=0) Total Kernel Time: 0.5996 Processing Output (Iteration 2) for device 0 tile 0 (m = 0, n = 0) Waiting for event from device 0 obuffer 0... Unlocking outputthread mutex 0 to process device 0 obuffer 0 Processing Output (Iteration 3) for device 1 tile 1 (m = 1, n = 0) Waiting for event from device 1 obuffer 0... Processing Output (Iteration 4) for device 0 tile 2 (m = 2, n = 0) Waiting for event from device 0 obuffer 1... Waiting to finish merge process for device 0 obuffer 0 Slave thread 0 (device 0) starting merge process for obuffer 0 (k = 0) Merge time: 0.080 Unlocking mutex device 0 obuffer 0 (Slavethread 0) Waiting to finish merge process for device 1 obuffer 0 Waiting to finish merge process for device 1 obuffer 1 Waiting to finish merge process for device 0 obuffer 2 Waiting to finish merge process for device 1 obuffer 2 Caldgemm Main Thread, setting CPU mask 80 DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.733 System Gflops 46.938 Times: Kernel Divide (1,1) Merge Copy To Copy From 0.5996 (57.2786 Gflops) 0.0213 (3.1549 GB/s) 0.0803 (0.0000 GB/s) 0.0309 (2.1699 GB/s) 0.0000 (0.0000 Gb/s) Uninitializing CALDGEMM Uninitializing buffers for device 0 context 0 Freeing CAL Host memory, device 0 context 0 buffer 0 Freeing CAL Host memory, device 0 context 0 buffer 1 Freeing CAL Host memory, device 0 context 0 buffer 2 Freeing CAL Host memory, device 0 context 0 buffer 3 Freeing CAL Host memory, device 0 context 0 buffer 4 Freeing CAL Host memory, device 0 context 0 buffer 5 Freeing CAL Host memory, device 0 context 0 buffer 6 Freeing CAL Host memory, device 0 context 0 buffer 7 Freeing CAL Host memory, device 0 context 0 buffer 8 Freeing CAL Host memory, device 0 context 0 buffer 9 Freeing CAL Host memory, device 0 context 0 buffer 10 Freeing CAL Host memory, device 0 context 0 buffer 11 Freeing CAL Host memory, device 0 context 0 buffer 12 Freeing CAL GPU memory, device 0 context 0 buffer 0 Freeing CAL GPU memory, device 0 context 0 buffer 1 Freeing CAL GPU memory, device 0 context 0 buffer 2 Freeing CAL GPU memory, device 0 context 0 buffer 3 Freeing CAL GPU memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 5 Freeing CAL GPU memory, device 0 context 0 buffer 6 Freeing CAL GPU memory, device 0 context 0 buffer 7 Freeing CAL GPU memory, device 0 context 0 buffer 8 Freeing CAL GPU memory, device 0 context 0 buffer 9 Freeing CAL GPU memory, device 0 context 0 buffer 10 Freeing CAL GPU memory, device 0 context 0 buffer 11 Freeing CAL GPU memory, device 0 context 0 buffer 12 Trying to terminate merge slave 0 Uninitializing buffers for device 0 context 1 Freeing CAL Host memory, device 0 context 1 buffer 0 merge slave 0 terminating Freeing CAL Host memory, device 0 context 1 buffer 1 Freeing CAL Host memory, device 0 context 1 buffer 2 Freeing CAL Host memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 0 Freeing CAL GPU memory, device 0 context 1 buffer 1 Freeing CAL GPU memory, device 0 context 1 buffer 2 Freeing CAL GPU memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 5 Freeing CAL GPU memory, device 0 context 1 buffer 6 Freeing CAL GPU memory, device 0 context 1 buffer 7 Freeing CAL GPU memory, device 0 context 1 buffer 8 Freeing CAL GPU memory, device 0 context 1 buffer 9 Freeing CAL GPU memory, device 0 context 1 buffer 10 Freeing CAL GPU memory, device 0 context 1 buffer 11 Freeing CAL GPU memory, device 0 context 1 buffer 12 Trying to terminate merge slave 1 Uninitializing buffers for device 0 context 2 Freeing CAL GPU memory, device 0 context 2 buffer 2 merge slave 1 terminating Freeing CAL GPU memory, device 0 context 2 buffer 3 Freeing CAL GPU memory, device 0 context 2 buffer 5 Freeing CAL GPU memory, device 0 context 2 buffer 6 Freeing CAL GPU memory, device 0 context 2 buffer 7 Freeing CAL GPU memory, device 0 context 2 buffer 8 Freeing CAL GPU memory, device 0 context 2 buffer 9 Freeing CAL GPU memory, device 0 context 2 buffer 10 Freeing CAL GPU memory, device 0 context 2 buffer 11 Freeing CAL GPU memory, device 0 context 2 buffer 12 Uninitializing buffers for device 0 context 3 Freeing CAL GPU memory, device 0 context 3 buffer 2 Freeing CAL GPU memory, device 0 context 3 buffer 3 Uninitializing buffers for device 0 context 4 Freeing CAL GPU memory, device 0 context 4 buffer 2 Freeing CAL GPU memory, device 0 context 4 buffer 3 Uninitializing buffers for device 0 context 5 Freeing CAL GPU memory, device 0 context 5 buffer 2 Freeing CAL GPU memory, device 0 context 5 buffer 3 Uninitializing buffers for device 0 context 6 Freeing CAL GPU memory, device 0 context 6 buffer 2 Freeing CAL GPU memory, device 0 context 6 buffer 3 Uninitializing buffers for device 0 context 7 Freeing CAL GPU memory, device 0 context 7 buffer 2 Freeing CAL GPU memory, device 0 context 7 buffer 3 Uninitializing buffers for device 0 context 8 Freeing CAL GPU memory, device 0 context 8 buffer 2 Freeing CAL GPU memory, device 0 context 8 buffer 3 Uninitializing buffers for device 0 context 9 Freeing CAL GPU memory, device 0 context 9 buffer 2 Freeing CAL GPU memory, device 0 context 9 buffer 3 Uninitializing buffers for device 0 context 10 Freeing CAL GPU memory, device 0 context 10 buffer 2 Freeing CAL GPU memory, device 0 context 10 buffer 3 Uninitializing buffers for device 0 context 11 Freeing CAL GPU memory, device 0 context 11 buffer 2 Freeing CAL GPU memory, device 0 context 11 buffer 3 Uninitializing buffers for device 0 context 12 Freeing CAL GPU memory, device 0 context 12 buffer 2 Freeing CAL GPU memory, device 0 context 12 buffer 3 Uninitializing buffers for device 0 context 13 Freeing CAL GPU memory, device 0 context 13 buffer 2 Freeing CAL GPU memory, device 0 context 13 buffer 3 Uninitializing buffers for device 0 context 14 Freeing CAL GPU memory, device 0 context 14 buffer 2 Freeing CAL GPU memory, device 0 context 14 buffer 3 Uninitializing buffers for device 0 context 15 Freeing CAL GPU memory, device 0 context 15 buffer 2 Freeing CAL GPU memory, device 0 context 15 buffer 3 Uninitializing buffers for device 0 context 16 Freeing CAL GPU memory, device 0 context 16 buffer 2 Freeing CAL GPU memory, device 0 context 16 buffer 3 Uninitializing buffers for device 0 context 17 Freeing CAL GPU memory, device 0 context 17 buffer 2 Freeing CAL GPU memory, device 0 context 17 buffer 3 Uninitializing buffers for device 0 context 18 Freeing CAL GPU memory, device 0 context 18 buffer 2 Freeing CAL GPU memory, device 0 context 18 buffer 3 Uninitializing buffers for device 0 context 19 Freeing CAL GPU memory, device 0 context 19 buffer 2 Freeing CAL GPU memory, device 0 context 19 buffer 3 Uninitializing buffers for device 0 context 20 Freeing CAL GPU memory, device 0 context 20 buffer 2 Freeing CAL GPU memory, device 0 context 20 buffer 3 Uninitializing buffers for device 1 context 0 Freeing CAL Host memory, device 1 context 0 buffer 0 Freeing CAL Host memory, device 1 context 0 buffer 1 Freeing CAL Host memory, device 1 context 0 buffer 2 Freeing CAL Host memory, device 1 context 0 buffer 3 Freeing CAL Host memory, device 1 context 0 buffer 4 Freeing CAL GPU memory, device 1 context 0 buffer 0 Freeing CAL GPU memory, device 1 context 0 buffer 1 Freeing CAL GPU memory, device 1 context 0 buffer 2 Freeing CAL GPU memory, device 1 context 0 buffer 3 Freeing CAL GPU memory, device 1 context 0 buffer 4 Freeing CAL GPU memory, device 1 context 0 buffer 5 Freeing CAL GPU memory, device 1 context 0 buffer 6 Freeing CAL GPU memory, device 1 context 0 buffer 7 Freeing CAL GPU memory, device 1 context 0 buffer 8 Freeing CAL GPU memory, device 1 context 0 buffer 9 Freeing CAL GPU memory, device 1 context 0 buffer 10 Freeing CAL GPU memory, device 1 context 0 buffer 11 Freeing CAL GPU memory, device 1 context 0 buffer 12 Trying to terminate merge slave 0 Uninitializing buffers for device 1 context 1 Freeing CAL Host memory, device 1 context 1 buffer 0 merge slave 0 terminating Freeing CAL Host memory, device 1 context 1 buffer 1 Freeing CAL Host memory, device 1 context 1 buffer 2 Freeing CAL Host memory, device 1 context 1 buffer 3 Freeing CAL GPU memory, device 1 context 1 buffer 0 Freeing CAL GPU memory, device 1 context 1 buffer 1 Freeing CAL GPU memory, device 1 context 1 buffer 2 Freeing CAL GPU memory, device 1 context 1 buffer 3 Freeing CAL GPU memory, device 1 context 1 buffer 5 Freeing CAL GPU memory, device 1 context 1 buffer 6 Freeing CAL GPU memory, device 1 context 1 buffer 7 Freeing CAL GPU memory, device 1 context 1 buffer 8 Freeing CAL GPU memory, device 1 context 1 buffer 9 Freeing CAL GPU memory, device 1 context 1 buffer 10 Freeing CAL GPU memory, device 1 context 1 buffer 11 Freeing CAL GPU memory, device 1 context 1 buffer 12 Trying to terminate merge slave 1 Uninitializing buffers for device 1 context 2 Freeing CAL GPU memory, device 1 context 2 buffer 2 merge slave 1 terminating Freeing CAL GPU memory, device 1 context 2 buffer 3 Freeing CAL GPU memory, device 1 context 2 buffer 5 Freeing CAL GPU memory, device 1 context 2 buffer 6 Freeing CAL GPU memory, device 1 context 2 buffer 7 Freeing CAL GPU memory, device 1 context 2 buffer 8 Freeing CAL GPU memory, device 1 context 2 buffer 9 Freeing CAL GPU memory, device 1 context 2 buffer 10 Freeing CAL GPU memory, device 1 context 2 buffer 11 Freeing CAL GPU memory, device 1 context 2 buffer 12 Uninitializing buffers for device 1 context 3 Freeing CAL GPU memory, device 1 context 3 buffer 2 Freeing CAL GPU memory, device 1 context 3 buffer 3 Uninitializing buffers for device 1 context 4 Freeing CAL GPU memory, device 1 context 4 buffer 2 Freeing CAL GPU memory, device 1 context 4 buffer 3 Uninitializing buffers for device 1 context 5 Freeing CAL GPU memory, device 1 context 5 buffer 2 Freeing CAL GPU memory, device 1 context 5 buffer 3 Uninitializing buffers for device 1 context 6 Freeing CAL GPU memory, device 1 context 6 buffer 2 Freeing CAL GPU memory, device 1 context 6 buffer 3 Uninitializing buffers for device 1 context 7 Freeing CAL GPU memory, device 1 context 7 buffer 2 Freeing CAL GPU memory, device 1 context 7 buffer 3 Uninitializing buffers for device 1 context 8 Freeing CAL GPU memory, device 1 context 8 buffer 2 Freeing CAL GPU memory, device 1 context 8 buffer 3 Uninitializing buffers for device 1 context 9 Freeing CAL GPU memory, device 1 context 9 buffer 2 Freeing CAL GPU memory, device 1 context 9 buffer 3 Uninitializing buffers for device 1 context 10 Freeing CAL GPU memory, device 1 context 10 buffer 2 Freeing CAL GPU memory, device 1 context 10 buffer 3 Uninitializing buffers for device 1 context 11 Freeing CAL GPU memory, device 1 context 11 buffer 2 Freeing CAL GPU memory, device 1 context 11 buffer 3 Uninitializing buffers for device 1 context 12 Freeing CAL GPU memory, device 1 context 12 buffer 2 Freeing CAL GPU memory, device 1 context 12 buffer 3 Uninitializing buffers for device 1 context 13 Freeing CAL GPU memory, device 1 context 13 buffer 2 Freeing CAL GPU memory, device 1 context 13 buffer 3 Uninitializing buffers for device 1 context 14 Freeing CAL GPU memory, device 1 context 14 buffer 2 Freeing CAL GPU memory, device 1 context 14 buffer 3 Uninitializing buffers for device 1 context 15 Freeing CAL GPU memory, device 1 context 15 buffer 2 Freeing CAL GPU memory, device 1 context 15 buffer 3 Uninitializing buffers for device 1 context 16 Freeing CAL GPU memory, device 1 context 16 buffer 2 Freeing CAL GPU memory, device 1 context 16 buffer 3 Uninitializing buffers for device 1 context 17 Freeing CAL GPU memory, device 1 context 17 buffer 2 Freeing CAL GPU memory, device 1 context 17 buffer 3 Uninitializing buffers for device 1 context 18 Freeing CAL GPU memory, device 1 context 18 buffer 2 Freeing CAL GPU memory, device 1 context 18 buffer 3 Uninitializing buffers for device 1 context 19 Freeing CAL GPU memory, device 1 context 19 buffer 2 Freeing CAL GPU memory, device 1 context 19 buffer 3 Uninitializing buffers for device 1 context 20 Freeing CAL GPU memory, device 1 context 20 buffer 2 Freeing CAL GPU memory, device 1 context 20 buffer 3 Uninitializing context for device 0 Uninitializing context for device 1 Uninitializing CAL runtime Trying to terminate linpack slave Waiting for linpack slave to terminate Waiting for merge threads to terminate linpack slave terminating rolly@rolly-X8DTG-QF:~/caldgemm$

0 Likes

I think caldgemm currently requires 2 to 3 CPU-Cores per GPU (would have to check the source), so yes, on your CPUs it probably won't be able to support more than two 6990s.

This is to some extend owed to the fact that we currently use Magny-Cours-CPUs  -> Plenty of cores.

0 Likes

is there any place where i can see a performance comparison between caldgemm and clAmdBlasDgemm ?

0 Likes

Originally posted by: laobrasuca is there any place where i can see a performance comparison between caldgemm and clAmdBlasDgemm ?

 

Hi, I did some test with acmlgpu1.1.2. as I run the Info.exe, it shows

 

rolly@rolly-X8DTG-QF:/opt/acmlgpu1.1.2/GPGPUexamples$ ./Info.exe CPUID: function (0) Vendor: GenuineIntel function (1) Family-Model-Stepping: 6-44-2 Feature flags (EDX): BFEBFBFFh Feature flags (ECX): 009EE3FDh MMX (EDX bit 13): yes SSE1 (EDX bit 25): yes SSE2 (EDX bit 26): yes SSE3 (ECX bit 0): yes SSSE3 (ECX bit 9): yes SSE4.1 (ECX bit 19): yes SSE4.2 (ECX bit 20): yes AVX (ECX bit 28): no function (8000_0004) Processor Brand: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz > uname -a Linux rolly-X8DTG-QF 2.6.35-28-generic #50-Ubuntu SMP Fri Mar 18 18:42:20 UTC 2011 x86_64 GNU/Linux > powersave -c sh: powersave: not found CAL RT version: 1.4.1385 CAL CL version: 1.4.1385 gpu0: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu1: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu2: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu3: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu4: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu5: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu6: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu7: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported GPUs found: 8

0 Likes

However, as I run this time_dgemm.exe, it looks like I am hitting the same wall, it just can make use of 3 out of 8 GPUs... but I have 32GB of host memory?

rolly@rolly-X8DTG-QF:/opt/acmlgpu1.1.2/GPGPUexamples$ ./time_dgemm.exe Matrix Time in Performance Size Seconds in Megaflops ------ ------------ ------------ ERROR: gpu3 - unable to allocate minimum cached system (GART) memory gpu3 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu4 - unable to allocate minimum cached system (GART) memory gpu4 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu5 - unable to allocate minimum cached system (GART) memory gpu5 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu6 - unable to allocate minimum cached system (GART) memory gpu6 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu7 - unable to allocate minimum cached system (GART) memory gpu7 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1728 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 472 MB 5242880 ( 5 MB) FAILED WARNING: 5 out of 8 GPUs failed to initialize; proceeding with other(s). 400 2.250818 56 600 0.045632 9467 800 0.049524 20676 1000 0.068471 29209 1200 0.086970 39737 1400 0.109446 50143 1600 0.141187 58022 1800 0.177105 65859 2000 0.206845 77352 2200 0.234911 90655 2400 0.259695 106463 2600 0.290227 121118 2800 0.331030 132628 3000 0.377459 143061 3200 0.361680 181198 3400 0.395542 198735 3600 0.431999 216000 3800 0.467440 234776 4000 0.520821 245765 4200 0.566723 261460 4400 0.618775 275331 4600 0.671366 289963 4800 0.736608 300273 5000 0.801185 312037 5200 0.888577 316479 5400 0.937255 336011 5600 1.007444 348636 5800 1.065766 366144 6000 1.155561 373844 6200 1.212021 393273 6400 1.287458 407227 6600 1.331005 431998 6800 1.375383 457228 7000 1.467186 467561 7200 1.515511 492570 7400 1.626788 498189 7600 1.744087 503387 7800 1.843866 514735 8000 1.918230 533825

0 Likes

Hi, I did some test with acmlgpu1.1.2. as I run the Info.exe, it shows

 

hi there, what's the difference between ACML-GPU and clAmdBlas? Would it be that one is CAL and the other OpenCL? And what about performance (at least for single GPU setup)?

0 Likes

Hi, I think you are right. clAmdBlas needs OpenCL but I find that there is only sgemm example for clAmdBlas, so I may not be able to compare dgemm performance of the two libraries?

0 Likes

yes, there's only the sgemm example (but be aware that this example has a typo fault - matrix A is written to the bufB - check this post http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=150952&enterthread=y), but, well, they perform exactly the same mathematical operations except for the data type (double instead of float), so I believe you can use the exact same example changing the types only and having a card that supports double precision computations (like the 6790 of yours).

if you comprare performance results with caldgemm, please let us know.

0 Likes

OK, let's have the acmlgpu-1.1.2 first for both dgemm and sgemm on single HD6970.

rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_dgemm.exe
Matrix  Time in       Performance
Size    Seconds       in Megaflops
------  ------------  ------------
   400      0.758880          168
   600      0.030139        14333
   800      0.035727        28662
  1000      0.049771        40184
  1200      0.067966        50849
  1400      0.068995        79541
  1600      0.086494        94711
  1800      0.111565       104549
  2000      0.134369       119075
  2200      0.161584       131795
  2400      0.184988       149458
  2600      0.214080       164200
  2800      0.241470       181819
  3000      0.288899       186916
  3200      0.333995       196218
  3400      0.415724       189087
  3600      0.494598       188662
  3800      0.565289       194137
  4000      0.662003       193352
  4200      0.762732       194270
  4400      0.829545       205375
  4600      0.866307       224714
  4800      0.953990       231851
  5000      1.100728       227122
  5200      1.219832       230536
  5400      1.368855       230066
  5600      1.569290       223815
  5800      1.789924       218011
  6000      2.022892       213555
  6200      1.990262       239494
  6400      2.092950       250501
  6600      2.339395       245786
  6800      2.555408       246091
  7000      2.725500       251696
  7200      2.983605       250199
  7400      3.627176       223437
  7600      3.460915       253676
  7800      3.715710       255430
  8000      4.046330       253068


rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_sgemm.exe
Matrix  Time in       Performance
Size    Seconds       in Megaflops
------  ------------  ------------
   400      0.711834          179
   600      0.021887        19738
   800      0.029878        34273
  1000      0.030939        64643
  1200      0.035729        96728
  1400      0.042776       128295
  1600      0.050936       160829
  1800      0.061199       190590
  2000      0.071055       225176
  2200      0.083770       254220
  2400      0.092456       299038
  2600      0.108912       322755
  2800      0.122735       357714
  3000      0.132517       407495
  3200      0.158700       412954
  3400      0.190118       413468
  3600      0.211197       441824
  3800      0.244340       449143
  4000      0.273162       468585
  4200      0.352204       420711
  4400      0.388507       438519
  4600      0.404213       481607
  4800      0.450925       490511
  5000      0.492675       507434
  5200      0.560158       502030
  5400      0.652638       482546
  5600      0.721320       486929
  5800      0.721891       540558
  6000      0.828848       521205
  6200      0.966798       493025
  6400      0.994621       527123
  6600      1.134357       506888
  6800      1.240944       506762
  7000      1.281981       535109
  7200      1.296301       575866
  7400      1.446126       560427
  7600      1.502031       584510
  7800      1.653382       574038
  8000      1.928930       530864
rolly@rolly-p5q-pro:~/GPGPUexamples$

0 Likes

Now the caldgemm,

rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 4096 -n 4096
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b58989a8010, B=0x2b589a9e9010, C=0x2b589c9fa010, (C-A=8430592, (C-B)/w=4104))
Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.208 System Gflops 165.602

rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 8192 -n 8192
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2b1edc693010, B=0x2b1ee0714010, C=0x2b1ee4725010, (C-A=16851968, (C-B)/w=8200))
Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 0.581 System Gflops 236.899

0 Likes

What I can conclude so far:

(1) Only 1 HD6990 can only run on these libraries no matter how many extra of these are installed?

(2) acml-gpu looks having batter performance?

Thanks for reading!

0 Likes

(1) You mean 1 HD6970, right? Well, it seems to be it, since the performance is roughly 1/5 of the nominal TERAFLOP number in the best of the two libraries. Can anyone confirm this?

(2) There's something I didn't understand. Is the matrices size for acmlgpu-1.1.2 equal to 8000x8000? Because caldgemm do the product for 8192x1024, which is a huge difference. Since the size of the matrix has a visible influence on the Gflops, we can hardly compare these results. Could you re-run them for comparable matrix sizes? However, I don't know if the algorithm is optimized for square matrices or not. 

And, if you have a little more of time, could you test the clAmdBlasDegmm (and maybe clAmdBlasSegmm to compare to time_sgemm.exe)?

0 Likes

Originally posted by: Marix I think caldgemm currently requires 2 to 3 CPU-Cores per GPU (would have to check the source), so yes, on your CPUs it probably won't be able to support more than two 6990s.

 

This is to some extend owed to the fact that we currently use Magny-Cours-CPUs  -> Plenty of cores.

 

Hi Marix, thanks for your clarification, I have 2 E5620 on my system with Hyperthread enabled, so system monitor shows 16 CPUs and I should have 2 CPUs per Cayman GPU. Is this still insufficient for the caldgemm requirement? I believed the max CPU cores per node is 24 with Intel 1366 pin processors, so that makes 3 CPUs per Cayman...

0 Likes

I am having similar memory issues.  I am running an HD5870.  I tried following the instructions given on the Wiki.

Here are things I ddi not do:

Instead of using git, I just downloaded/unzipped the latest version from the Files Page

-march=native didn't work so I just deleted it from makefile so that it can compile error-free.

I did not use the binary patch for the Catalyst driver.

 

Here is my output given your instructions of ./dgemm_bench -g -z -v -d

./dgemm_bench -g -z -v -d
Use -? for help
Init Caldgemm, setting CPU mask 1
CAL Runtime Version:1.4.1016
Initializing CAL
Initializing CALDGEMM for 1 devices
Allocating Host buffer for device 0 obuffer 0 buffer 0
Allocating device buffer for device 0 obuffer 0 buffer 0
Allocating Host buffer for device 0 obuffer 0 buffer 1
Allocating device buffer for device 0 obuffer 0 buffer 1
Allocating Host buffer for device 0 obuffer 0 buffer 2
Allocating device buffer for device 0 obuffer 0 buffer 2
Allocating Host buffer for device 0 obuffer 0 buffer 3
Allocating device buffer for device 0 obuffer 0 buffer 3
Allocating Host memory for device 0 obuffer 0 buffer 4
Allocating device buffer for device 0 obuffer 0 buffer 5
Allocating device buffer for device 0 obuffer 0 buffer 6
Allocating device buffer for device 0 obuffer 0 buffer 7
Allocating device buffer for device 0 obuffer 0 buffer 8
Allocating device buffer for device 0 obuffer 0 buffer 9
Allocating device buffer for device 0 obuffer 0 buffer 10
Allocating device buffer for device 0 obuffer 0 buffer 11
Allocating device buffer for device 0 obuffer 0 buffer 12
Allocating Host Constant buffer device 0 context 0 buffer 4
Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7
Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7
Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7
Merger Thread 0 started
Merge Thread 0, setting CPU mask 2
Allocating Host buffer for device 0 obuffer 1 buffer 0
Allocating device buffer for device 0 obuffer 1 buffer 0
Allocating Host buffer for device 0 obuffer 1 buffer 1
Allocating device buffer for device 0 obuffer 1 buffer 1
Allocating Host buffer for device 0 obuffer 1 buffer 2
Allocating device buffer for device 0 obuffer 1 buffer 2
Allocating Host buffer for device 0 obuffer 1 buffer 3
Allocating device buffer for device 0 obuffer 1 buffer 3
Allocating device buffer for device 0 obuffer 1 buffer 5
Allocating device buffer for device 0 obuffer 1 buffer 6
Allocating device buffer for device 0 obuffer 1 buffer 7
Allocating device buffer for device 0 obuffer 1 buffer 8
Allocating device buffer for device 0 obuffer 1 buffer 9
Allocating device buffer for device 0 obuffer 1 buffer 10
There was an error in allocating resources and binding them to memory
Error initializing CALDGEMM

 

Thanks

0 Likes