Hi,
I am sorry that I cannot find the caldgemm forum so I post here and hopefully its developers read this message.
I have 4 HD6990s and I really like to see how they perform in GFLOPs, so I come across this tool, http://code.compeng.uni-frankfurt.de/projects/caldgemm/wiki
I have Ubuntu 10.10 x86_64 + AMD driver 11.5 + SDK 2.4 and I followed the instructions on the wiki, however I have to add "make TARGET=NEHALEM NO_MEMPOLICY=1 -j" in order to compile GotoBLAS2 since I have two E5620 CPUs.
I compile caldgemm, its outputs look fine.
But If I run it, it prompts error and hangs idle, can anyone help? Thanks!
rolly@rolly-X8DTG-QF:~/caldgemm$ make g++ -c caldgemm.cpp -Wfloat-equal -Wpointer-arith -DATI_OS_LINUX -g3 -ffor-scope -O3 -march=core2 -ftree-vectorize -msse3 -fkeep-inline-functions -fweb -frename-registers -minline-all-stringops -funit-at-a-time -mfpmath=sse -ftracer -finline-limit=1200 -fpeel-loops -D_NO_AMD_CPU -I ../GotoBLAS2 -I /home/rolly/AMD-APP-SDK-v2.4-lnx64/include/CAL caldgemm.cpp: In function ‘void* divide_wrapper(void*)’: caldgemm.cpp:1661: warning: format ‘%lld’ expects type ‘long long int’, but argument 4 has type ‘int’ caldgemm.cpp: In member function ‘int caldgemm::RunCALDGEMM(double*, double*, double*, double, double, size_t, size_t, size_t, size_t, size_t, size_t, CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int)’: caldgemm.cpp:2260: warning: format ‘%lld’ expects type ‘long long int’, but argument 3 has type ‘size_t’ caldgemm.cpp:2269: warning: format ‘%lld’ expects type ‘long long int’, but argument 3 has type ‘size_t’ caldgemm.cpp:2403: warning: format ‘%lld’ expects type ‘long long int’, but argument 4 has type ‘size_t’ caldgemm.cpp:2421: warning: format ‘%lld’ expects type ‘long long int’, but argument 4 has type ‘size_t’ caldgemm.cpp: In member function ‘int caldgemm::DGEMM_prepare(size_t, int, unsigned int)’: caldgemm.cpp:2738: warning: format ‘%d’ expects type ‘int’, but argument 4 has type ‘size_t’ caldgemm.cpp:2761: warning: format ‘%d’ expects type ‘int’, but argument 4 has type ‘size_t’ g++ -c benchmark.cpp -Wfloat-equal -Wpointer-arith -DATI_OS_LINUX -g3 -ffor-scope -O3 -march=core2 -ftree-vectorize -msse3 -fkeep-inline-functions -fweb -frename-registers -minline-all-stringops -funit-at-a-time -mfpmath=sse -ftracer -finline-limit=1200 -fpeel-loops -D_NO_AMD_CPU -I ../GotoBLAS2 -I /home/rolly/AMD-APP-SDK-v2.4-lnx64/include/CAL g++ -o dgemm_bench caldgemm.o benchmark.o -lpthread -ldl -L/usr/X11R6/lib -laticalrt -laticalcl -lgfortran ../GotoBLAS2/libgoto2.a rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c Use -? for help Cannot use multiple devices without multithreading Segmentation fault rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -z Use -? for help There was an error in allocating resources and binding them to memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc AERROR locking Pages ...alloc BERROR locking Pages ...alloc CERROR locking Pages Memory allocation error allocating matrices
There currently is no forum for caldgemm, however there is a (low volume) mailing list at https://compeng.uni-frankfurt.de/mailman/listinfo/caldgemm .
What I can see from your output ist the following:
Hi Marix,
if I do
rolly@rolly-X8DTG-QF:~$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 20
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
So I made some change to /etc/security/limits.conf
http://www.akadia.com/services/ora_enable_core.html
now I can change ulimit -l unlimited and it looks like
rolly@rolly-X8DTG-QF:~/caldgemm$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 20
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Now running benchmark,
rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b2392afc010, B=0x2b2394b3d010, C=0x2b2396b4e010, (C-A=8430592, (C-B)/w=4104))
Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.656 System Gflops 52.459
But with -z option, it still failed,
rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -z
Use -? for help
There was an error in allocating resources and binding them to memory
Error initializing CALDGEMM
Any hint on the last error? Thank you!
HI Marix,
Further update to my problem, I think this is due to multiple GPU issues. I did the same for another system with same software config but just a single HD6970, it nnow produce the reasonable results:
Is this true that the system performance just 164 GFLOPS vs kernel 465 GFLOPS for a single GPU HD6970.
For my 4x HD6990s, the -g parameter does not work at all...
Thanks!
rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -c Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ab9dea74010, B=0x2ab9e0ab5010, C=0x2ab9e2ac6010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 1.652 System Gflops 20.822 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ad9ccb3c010, B=0x2ad9ceb7d010, C=0x2ad9d0b8e010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.210 System Gflops 163.418 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -v Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ad235f2a010, B=0x2ad237f6b010, C=0x2ad239f7c010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.210 System Gflops 163.892 Times: Kernel Divide (1,1) Merge Copy To Copy From 0.0737 (465.7270 Gflops) 0.0296 (2.2666 GB/s) 0.0934 (1.4375 GB/s) 0.0128 (5.2401 GB/s) 0.0000 (0.0000 Gb/s)
Regarding the low system performance. There is currently a known performance issue with all HD6000 series devices. It can be tuned around and there is a new version with some proper workarounds in the queue. However the copy speeds look aktually quite good in your case. Your matrix size is, however, rather small. Why you should get quite some performance at that size it would be interesting to see what you can reach at 20k or even 40k for m and n (k may stay at 1024).
Originally posted by: Marix Regarding the low system performance. There is currently a known performance issue with all HD6000 series devices. It can be tuned around and there is a new version with some proper workarounds in the queue. However the copy speeds look aktually quite good in your case. Your matrix size is, however, rather small. Why you should get quite some performance at that size it would be interesting to see what you can reach at 20k or even 40k for m and n (k may stay at 1024).
Hi Marix, thanks for your info, I rerun the test this time on the single HD6970 with 4GB host memory, so I can only run m=n=16384.
Please have a look at the output., the best I get is 212 GFLOPS. Thank you
rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ae9c5ae4010, B=0x2ae9c7b25010, C=0x2ae9c9b36010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.328 System Gflops 104.980 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -m 8192 -n 8192 Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2af03c0d3010, B=0x2af040154010, C=0x2af044165010, (C-A=16851968, (C-B)/w=8200)) Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 1.003 System Gflops 137.169 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -m 16384 -n 16384 Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=16384 k=1024 n=16384 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x4008 LDC=0x4008 At=0 Bt=0 ColMajor=0 (A=0x2b02a29f4010, B=0x2b02aaaf5010, C=0x2b02b2b06010, (C-A=33694720, (C-B)/w=16392)) Program: caldgemm Sizes - A: 16384x1024 B: 1024x16384 C:16384x16384 (Host: rolly-p5q-pro) System Time 3.640 System Gflops 151.174 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -m 32768 -n 32768 Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc Cterminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped) rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -m 8192 -n 8192 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2b2177e92010, B=0x2b217bf13010, C=0x2b217ff24010, (C-A=16851968, (C-B)/w=8200)) Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 0.748 System Gflops 184.007 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -m 16384 -n 16384 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=16384 k=1024 n=16384 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x4008 LDC=0x4008 At=0 Bt=0 ColMajor=0 (A=0x2ba071595010, B=0x2ba079696010, C=0x2ba0816a7010, (C-A=33694720, (C-B)/w=16392)) Program: caldgemm Sizes - A: 16384x1024 B: 1024x16384 C:16384x16384 (Host: rolly-p5q-pro) System Time 2.587 System Gflops 212.735 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -m 32768 -n 32768 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc Cterminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped) rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -v -m 16384 -n 16384 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=16384 k=1024 n=16384 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x4008 LDC=0x4008 At=0 Bt=0 ColMajor=0 (A=0x2b344bd96010, B=0x2b3453e97010, C=0x2b345bea8010, (C-A=33694720, (C-B)/w=16392)) Program: caldgemm Sizes - A: 16384x1024 B: 1024x16384 C:16384x16384 (Host: rolly-p5q-pro) System Time 2.949 System Gflops 186.577 Times: Kernel Divide (4,4) Merge Copy To Copy From 1.2474 (440.5147 Gflops) 0.2862 (0.9380 GB/s) 0.4570 (0.0000 GB/s) 0.1072 (2.5045 GB/s) 0.0000 (0.0000 Gb/s)
Hi rollyng,
as marix said there is an issue related to 6000 series GPU that decreases system performance dramatically. However, the -z parameter should actually work.
to help debugging this problem can you do the following:
activate the DEBUG_MSG_ALLOCATION swith in caldgemm_config.h
set the STD_OUT parameter to stderr in caldgemm_config.h
run dgemm_bench -g -z -v -d and paste the output.
can you please also tell me exactly which version you are using?
Cheers
Hi David,
Thanks for your message, I did the following for the cpu run, please take a look first.
rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Cannot use multiple devices without multithreading Initializing CALDGEMM for 1 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Was able to allocate 21 bbuffers Using 8 CPU cores at 2401 MHz, 1 GPUs of 1536 shaders at 830 MHz Caldgemm Init complete, setting CPU mask 80 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2afa3f516010, B=0x2afa41557010, C=0x2afa43568010, (C-A=8430592, (C-B)/w=4104)) Running CPU only DGEMM DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.542 System Gflops 63.429 Times: Kernel Divide (0,0) Merge Copy To Copy From 0.0000 (inf Gflops) 0.0000 (-nan GB/s) 0.0000 (inf GB/s) 0.0000 (-nan GB/s) 0.0000 (0.0000 Gb/s) Uninitializing CALDGEMM Uninitializing buffers for device 0 context 0 Freeing CAL Host memory, device 0 context 0 buffer 0 Freeing temporary CAL memory, device 0 context 0 buffer 0 Freeing CAL Host memory, device 0 context 0 buffer 1 Freeing temporary CAL memory, device 0 context 0 buffer 1 Freeing CAL Host memory, device 0 context 0 buffer 2 Freeing temporary CAL memory, device 0 context 0 buffer 2 Freeing CAL Host memory, device 0 context 0 buffer 3 Freeing temporary CAL memory, device 0 context 0 buffer 3 Freeing CAL Host memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 0 Freeing CAL GPU memory, device 0 context 0 buffer 1 Freeing CAL GPU memory, device 0 context 0 buffer 2 Freeing CAL GPU memory, device 0 context 0 buffer 3 Freeing CAL GPU memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 5 Freeing CAL GPU memory, device 0 context 0 buffer 6 Freeing CAL GPU memory, device 0 context 0 buffer 7 Freeing CAL GPU memory, device 0 context 0 buffer 8 Freeing CAL GPU memory, device 0 context 0 buffer 9 Freeing CAL GPU memory, device 0 context 0 buffer 10 Freeing CAL GPU memory, device 0 context 0 buffer 11 Freeing CAL GPU memory, device 0 context 0 buffer 12 Uninitializing buffers for device 0 context 1 Freeing CAL Host memory, device 0 context 1 buffer 0 Freeing temporary CAL memory, device 0 context 1 buffer 0 Freeing CAL Host memory, device 0 context 1 buffer 1 Freeing temporary CAL memory, device 0 context 1 buffer 1 Freeing CAL Host memory, device 0 context 1 buffer 2 Freeing temporary CAL memory, device 0 context 1 buffer 2 Freeing CAL Host memory, device 0 context 1 buffer 3 Freeing temporary CAL memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 0 Freeing CAL GPU memory, device 0 context 1 buffer 1 Freeing CAL GPU memory, device 0 context 1 buffer 2 Freeing CAL GPU memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 5 Freeing CAL GPU memory, device 0 context 1 buffer 6 Freeing CAL GPU memory, device 0 context 1 buffer 7 Freeing CAL GPU memory, device 0 context 1 buffer 8 Freeing CAL GPU memory, device 0 context 1 buffer 9 Freeing CAL GPU memory, device 0 context 1 buffer 10 Freeing CAL GPU memory, device 0 context 1 buffer 11 Freeing CAL GPU memory, device 0 context 1 buffer 12 Uninitializing buffers for device 0 context 2 Freeing CAL GPU memory, device 0 context 2 buffer 2 Freeing CAL GPU memory, device 0 context 2 buffer 3 Freeing CAL GPU memory, device 0 context 2 buffer 5 Freeing CAL GPU memory, device 0 context 2 buffer 6 Freeing CAL GPU memory, device 0 context 2 buffer 7 Freeing CAL GPU memory, device 0 context 2 buffer 8 Freeing CAL GPU memory, device 0 context 2 buffer 9 Freeing CAL GPU memory, device 0 context 2 buffer 10 Freeing CAL GPU memory, device 0 context 2 buffer 11 Freeing CAL GPU memory, device 0 context 2 buffer 12 Uninitializing buffers for device 0 context 3 Freeing CAL GPU memory, device 0 context 3 buffer 2 Freeing CAL GPU memory, device 0 context 3 buffer 3 Uninitializing buffers for device 0 context 4 Freeing CAL GPU memory, device 0 context 4 buffer 2 Freeing CAL GPU memory, device 0 context 4 buffer 3 Uninitializing buffers for device 0 context 5 Freeing CAL GPU memory, device 0 context 5 buffer 2 Freeing CAL GPU memory, device 0 context 5 buffer 3 Uninitializing buffers for device 0 context 6 Freeing CAL GPU memory, device 0 context 6 buffer 2 Freeing CAL GPU memory, device 0 context 6 buffer 3 Uninitializing buffers for device 0 context 7 Freeing CAL GPU memory, device 0 context 7 buffer 2 Freeing CAL GPU memory, device 0 context 7 buffer 3 Uninitializing buffers for device 0 context 8 Freeing CAL GPU memory, device 0 context 8 buffer 2 Freeing CAL GPU memory, device 0 context 8 buffer 3 Uninitializing buffers for device 0 context 9 Freeing CAL GPU memory, device 0 context 9 buffer 2 Freeing CAL GPU memory, device 0 context 9 buffer 3 Uninitializing buffers for device 0 context 10 Freeing CAL GPU memory, device 0 context 10 buffer 2 Freeing CAL GPU memory, device 0 context 10 buffer 3 Uninitializing buffers for device 0 context 11 Freeing CAL GPU memory, device 0 context 11 buffer 2 Freeing CAL GPU memory, device 0 context 11 buffer 3 Uninitializing buffers for device 0 context 12 Freeing CAL GPU memory, device 0 context 12 buffer 2 Freeing CAL GPU memory, device 0 context 12 buffer 3 Uninitializing buffers for device 0 context 13 Freeing CAL GPU memory, device 0 context 13 buffer 2 Freeing CAL GPU memory, device 0 context 13 buffer 3 Uninitializing buffers for device 0 context 14 Freeing CAL GPU memory, device 0 context 14 buffer 2 Freeing CAL GPU memory, device 0 context 14 buffer 3 Uninitializing buffers for device 0 context 15 Freeing CAL GPU memory, device 0 context 15 buffer 2 Freeing CAL GPU memory, device 0 context 15 buffer 3 Uninitializing buffers for device 0 context 16 Freeing CAL GPU memory, device 0 context 16 buffer 2 Freeing CAL GPU memory, device 0 context 16 buffer 3 Uninitializing buffers for device 0 context 17 Freeing CAL GPU memory, device 0 context 17 buffer 2 Freeing CAL GPU memory, device 0 context 17 buffer 3 Uninitializing buffers for device 0 context 18 Freeing CAL GPU memory, device 0 context 18 buffer 2 Freeing CAL GPU memory, device 0 context 18 buffer 3 Uninitializing buffers for device 0 context 19 Freeing CAL GPU memory, device 0 context 19 buffer 2 Freeing CAL GPU memory, device 0 context 19 buffer 3 Uninitializing buffers for device 0 context 20 Freeing CAL GPU memory, device 0 context 20 buffer 2 Freeing CAL GPU memory, device 0 context 20 buffer 3 Uninitializing context for device 0 Uninitializing CAL runtime rolly@rolly-X8DTG-QF:~/caldgemm$
Here is the single GPU run on one of these HD6990s.
rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Cannot use multiple devices without multithreading Initializing CALDGEMM for 1 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Was able to allocate 21 bbuffers Using 8 CPU cores at 1600 MHz, 1 GPUs of 1536 shaders at 830 MHz Caldgemm Init complete, setting CPU mask 80 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ad1b34c6010, B=0x2ad1b5507010, C=0x2ad1b7518010, (C-A=8430592, (C-B)/w=4104)) Using Kernel 2 (alpha=0xBFF0000000000000 (-1.000), width = 1024) Caldgemm Main Thread, setting CPU mask 1 Initiliazing GPU Constant Buffers...0 Done GPU Curve Ration: 0.70, CPUScale 0.18, GPUScale 1.17 GPURatio automatically set to 0.94 Favoring m direction, 1 blocks Iteration k = 0, m = 0, n = 0 (device 0 obuffer 0) Running Preprocessing device = 0 k = 0 Dividing Buffer A (device = 0, k = 0, buffer = 0) SRC=0x2ad1b34c6010, w: 1024, h: 4096, pitch: 1032 (gpuw: 1024, gpuh: 4096, transpose: 0) Dividing Buffer B (device = 0, k = 0, buffer = 0) SRC=0x2ad1b5507010, w: 1024, h: 4096, pitch: 4104 (gpuw: 1024, gpuh: 4096, transpose: 1) Copying part of A to GPU (k = 0, m = 0, n = 0) Starting conversion kernel Total Kernel Time: 0.0006 Copying part of B to GPU (k = 0, m = 0, n = 0) Starting conversion kernel Total Kernel Time: 0.0194 Waiting for event from device 0 obuffer 0... Executing MM kernel (device 0 obuffer 0, k=0 m=0 n=0) Total Kernel Time: 0.6100 Processing Output (Iteration 1) for device 0 tile 0 (m = 0, n = 0) Waiting for event from device 0 obuffer 0... Merging buffer (device 0, obuffer 0, k = 0, main thread) Main thread unlocking obuffer mutex devuce 0 obuffer 0 Processing Output (Iteration 2) for device 0 tile 1 (m = 1, n = 0) Waiting for event from device 0 obuffer 1... Caldgemm Main Thread, setting CPU mask 80 DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.731 System Gflops 47.081 Times: Kernel Divide (1,1) Merge Copy To Copy From 0.6100 (56.2969 Gflops) 0.0191 (3.5123 GB/s) 0.0696 (1.9273 GB/s) 0.0301 (2.2296 GB/s) 0.0000 (0.0000 Gb/s) Uninitializing CALDGEMM Uninitializing buffers for device 0 context 0 Freeing CAL Host memory, device 0 context 0 buffer 0 Freeing temporary CAL memory, device 0 context 0 buffer 0 Freeing CAL Host memory, device 0 context 0 buffer 1 Freeing temporary CAL memory, device 0 context 0 buffer 1 Freeing CAL Host memory, device 0 context 0 buffer 2 Freeing temporary CAL memory, device 0 context 0 buffer 2 Freeing CAL Host memory, device 0 context 0 buffer 3 Freeing temporary CAL memory, device 0 context 0 buffer 3 Freeing CAL Host memory, device 0 context 0 buffer 4 Freeing CAL Host memory, device 0 context 0 buffer 5 Freeing CAL Host memory, device 0 context 0 buffer 6 Freeing CAL Host memory, device 0 context 0 buffer 7 Freeing CAL Host memory, device 0 context 0 buffer 8 Freeing CAL Host memory, device 0 context 0 buffer 9 Freeing CAL Host memory, device 0 context 0 buffer 10 Freeing CAL Host memory, device 0 context 0 buffer 11 Freeing CAL Host memory, device 0 context 0 buffer 12 Freeing CAL GPU memory, device 0 context 0 buffer 0 Freeing CAL GPU memory, device 0 context 0 buffer 1 Freeing CAL GPU memory, device 0 context 0 buffer 2 Freeing CAL GPU memory, device 0 context 0 buffer 3 Freeing CAL GPU memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 5 Freeing CAL GPU memory, device 0 context 0 buffer 6 Freeing CAL GPU memory, device 0 context 0 buffer 7 Freeing CAL GPU memory, device 0 context 0 buffer 8 Freeing CAL GPU memory, device 0 context 0 buffer 9 Freeing CAL GPU memory, device 0 context 0 buffer 10 Freeing CAL GPU memory, device 0 context 0 buffer 11 Freeing CAL GPU memory, device 0 context 0 buffer 12 Uninitializing buffers for device 0 context 1 Freeing CAL Host memory, device 0 context 1 buffer 0 Freeing temporary CAL memory, device 0 context 1 buffer 0 Freeing CAL Host memory, device 0 context 1 buffer 1 Freeing temporary CAL memory, device 0 context 1 buffer 1 Freeing CAL Host memory, device 0 context 1 buffer 2 Freeing temporary CAL memory, device 0 context 1 buffer 2 Freeing CAL Host memory, device 0 context 1 buffer 3 Freeing temporary CAL memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 0 Freeing CAL GPU memory, device 0 context 1 buffer 1 Freeing CAL GPU memory, device 0 context 1 buffer 2 Freeing CAL GPU memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 5 Freeing CAL GPU memory, device 0 context 1 buffer 6 Freeing CAL GPU memory, device 0 context 1 buffer 7 Freeing CAL GPU memory, device 0 context 1 buffer 8 Freeing CAL GPU memory, device 0 context 1 buffer 9 Freeing CAL GPU memory, device 0 context 1 buffer 10 Freeing CAL GPU memory, device 0 context 1 buffer 11 Freeing CAL GPU memory, device 0 context 1 buffer 12 Uninitializing buffers for device 0 context 2 Freeing CAL GPU memory, device 0 context 2 buffer 2 Freeing CAL GPU memory, device 0 context 2 buffer 3 Freeing CAL GPU memory, device 0 context 2 buffer 5 Freeing CAL GPU memory, device 0 context 2 buffer 6 Freeing CAL GPU memory, device 0 context 2 buffer 7 Freeing CAL GPU memory, device 0 context 2 buffer 8 Freeing CAL GPU memory, device 0 context 2 buffer 9 Freeing CAL GPU memory, device 0 context 2 buffer 10 Freeing CAL GPU memory, device 0 context 2 buffer 11 Freeing CAL GPU memory, device 0 context 2 buffer 12 Uninitializing buffers for device 0 context 3 Freeing CAL GPU memory, device 0 context 3 buffer 2 Freeing CAL GPU memory, device 0 context 3 buffer 3 Uninitializing buffers for device 0 context 4 Freeing CAL GPU memory, device 0 context 4 buffer 2 Freeing CAL GPU memory, device 0 context 4 buffer 3 Uninitializing buffers for device 0 context 5 Freeing CAL GPU memory, device 0 context 5 buffer 2 Freeing CAL GPU memory, device 0 context 5 buffer 3 Uninitializing buffers for device 0 context 6 Freeing CAL GPU memory, device 0 context 6 buffer 2 Freeing CAL GPU memory, device 0 context 6 buffer 3 Uninitializing buffers for device 0 context 7 Freeing CAL GPU memory, device 0 context 7 buffer 2 Freeing CAL GPU memory, device 0 context 7 buffer 3 Uninitializing buffers for device 0 context 8 Freeing CAL GPU memory, device 0 context 8 buffer 2 Freeing CAL GPU memory, device 0 context 8 buffer 3 Uninitializing buffers for device 0 context 9 Freeing CAL GPU memory, device 0 context 9 buffer 2 Freeing CAL GPU memory, device 0 context 9 buffer 3 Uninitializing buffers for device 0 context 10 Freeing CAL GPU memory, device 0 context 10 buffer 2 Freeing CAL GPU memory, device 0 context 10 buffer 3 Uninitializing buffers for device 0 context 11 Freeing CAL GPU memory, device 0 context 11 buffer 2 Freeing CAL GPU memory, device 0 context 11 buffer 3 Uninitializing buffers for device 0 context 12 Freeing CAL GPU memory, device 0 context 12 buffer 2 Freeing CAL GPU memory, device 0 context 12 buffer 3 Uninitializing buffers for device 0 context 13 Freeing CAL GPU memory, device 0 context 13 buffer 2 Freeing CAL GPU memory, device 0 context 13 buffer 3 Uninitializing buffers for device 0 context 14 Freeing CAL GPU memory, device 0 context 14 buffer 2 Freeing CAL GPU memory, device 0 context 14 buffer 3 Uninitializing buffers for device 0 context 15 Freeing CAL GPU memory, device 0 context 15 buffer 2 Freeing CAL GPU memory, device 0 context 15 buffer 3 Uninitializing buffers for device 0 context 16 Freeing CAL GPU memory, device 0 context 16 buffer 2 Freeing CAL GPU memory, device 0 context 16 buffer 3 Uninitializing buffers for device 0 context 17 Freeing CAL GPU memory, device 0 context 17 buffer 2 Freeing CAL GPU memory, device 0 context 17 buffer 3 Uninitializing buffers for device 0 context 18 Freeing CAL GPU memory, device 0 context 18 buffer 2 Freeing CAL GPU memory, device 0 context 18 buffer 3 Uninitializing buffers for device 0 context 19 Freeing CAL GPU memory, device 0 context 19 buffer 2 Freeing CAL GPU memory, device 0 context 19 buffer 3 Uninitializing buffers for device 0 context 20 Freeing CAL GPU memory, device 0 context 20 buffer 2 Freeing CAL GPU memory, device 0 context 20 buffer 3 Uninitializing context for device 0 Uninitializing CAL runtime rolly@rolly-X8DTG-QF:~/caldgemm$
Now I run -z for CPU only
rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c -z -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 8 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating temporary device buffer for device 1 context 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating temporary device buffer for device 1 context 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating temporary device buffer for device 1 context 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating temporary device buffer for device 1 context 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating temporary device buffer for device 1 context 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating temporary device buffer for device 1 context 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating temporary device buffer for device 1 context 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating temporary device buffer for device 1 context 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Allocating Host buffer for device 2 obuffer 0 buffer 0 Allocating device buffer for device 2 obuffer 0 buffer 0 Allocating temporary device buffer for device 2 context 0 buffer 0 Allocating Host buffer for device 2 obuffer 0 buffer 1 Allocating device buffer for device 2 obuffer 0 buffer 1 Allocating temporary device buffer for device 2 context 0 buffer 1 Allocating Host buffer for device 2 obuffer 0 buffer 2 Allocating device buffer for device 2 obuffer 0 buffer 2 Allocating temporary device buffer for device 2 context 0 buffer 2 Allocating Host buffer for device 2 obuffer 0 buffer 3 Allocating device buffer for device 2 obuffer 0 buffer 3 Allocating temporary device buffer for device 2 context 0 buffer 3 Allocating Host memory for device 2 obuffer 0 buffer 4 Allocating device buffer for device 2 obuffer 0 buffer 5 Allocating device buffer for device 2 obuffer 0 buffer 6 Allocating device buffer for device 2 obuffer 0 buffer 7 Allocating device buffer for device 2 obuffer 0 buffer 8 Allocating device buffer for device 2 obuffer 0 buffer 9 Allocating device buffer for device 2 obuffer 0 buffer 10 Allocating device buffer for device 2 obuffer 0 buffer 11 Allocating device buffer for device 2 obuffer 0 buffer 12 Allocating Host Constant buffer device 2 context 0 buffer 4 Getting module buffer name for device 2 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 20 Allocating Host buffer for device 2 obuffer 1 buffer 0 Allocating device buffer for device 2 obuffer 1 buffer 0 Allocating temporary device buffer for device 2 context 1 buffer 0 Allocating Host buffer for device 2 obuffer 1 buffer 1 Allocating device buffer for device 2 obuffer 1 buffer 1 Allocating temporary device buffer for device 2 context 1 buffer 1 Allocating Host buffer for device 2 obuffer 1 buffer 2 Allocating device buffer for device 2 obuffer 1 buffer 2 Allocating temporary device buffer for device 2 context 1 buffer 2 Allocating Host buffer for device 2 obuffer 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 3 Allocating temporary device buffer for device 2 context 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 5 Allocating device buffer for device 2 obuffer 1 buffer 6 Allocating device buffer for device 2 obuffer 1 buffer 7 Allocating device buffer for device 2 obuffer 1 buffer 8 Allocating device buffer for device 2 obuffer 1 buffer 9 Allocating device buffer for device 2 obuffer 1 buffer 10 Allocating device buffer for device 2 obuffer 1 buffer 11 Allocating device buffer for device 2 obuffer 1 buffer 12 There was an error in allocating resources and binding them to memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$
Finally -z for GPUs. To me the -z option does not work at all?
By the way, I just "git clone git://code.compeng.uni-frankfurt.de/caldgemm", am I having the latest version of caldgemm?
Thanks!
rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g -z -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 8 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating temporary device buffer for device 1 context 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating temporary device buffer for device 1 context 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating temporary device buffer for device 1 context 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating temporary device buffer for device 1 context 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating temporary device buffer for device 1 context 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating temporary device buffer for device 1 context 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating temporary device buffer for device 1 context 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating temporary device buffer for device 1 context 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Allocating Host buffer for device 2 obuffer 0 buffer 0 Allocating device buffer for device 2 obuffer 0 buffer 0 Allocating temporary device buffer for device 2 context 0 buffer 0 Allocating Host buffer for device 2 obuffer 0 buffer 1 Allocating device buffer for device 2 obuffer 0 buffer 1 Allocating temporary device buffer for device 2 context 0 buffer 1 Allocating Host buffer for device 2 obuffer 0 buffer 2 Allocating device buffer for device 2 obuffer 0 buffer 2 Allocating temporary device buffer for device 2 context 0 buffer 2 Allocating Host buffer for device 2 obuffer 0 buffer 3 Allocating device buffer for device 2 obuffer 0 buffer 3 Allocating temporary device buffer for device 2 context 0 buffer 3 Allocating Host memory for device 2 obuffer 0 buffer 4 Allocating device buffer for device 2 obuffer 0 buffer 5 Allocating device buffer for device 2 obuffer 0 buffer 6 Allocating device buffer for device 2 obuffer 0 buffer 7 Allocating device buffer for device 2 obuffer 0 buffer 8 Allocating device buffer for device 2 obuffer 0 buffer 9 Allocating device buffer for device 2 obuffer 0 buffer 10 Allocating device buffer for device 2 obuffer 0 buffer 11 Allocating device buffer for device 2 obuffer 0 buffer 12 Allocating Host Constant buffer device 2 context 0 buffer 4 Getting module buffer name for device 2 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 20 Allocating Host buffer for device 2 obuffer 1 buffer 0 Allocating device buffer for device 2 obuffer 1 buffer 0 Allocating temporary device buffer for device 2 context 1 buffer 0 Allocating Host buffer for device 2 obuffer 1 buffer 1 Allocating device buffer for device 2 obuffer 1 buffer 1 Allocating temporary device buffer for device 2 context 1 buffer 1 Allocating Host buffer for device 2 obuffer 1 buffer 2 Allocating device buffer for device 2 obuffer 1 buffer 2 Allocating temporary device buffer for device 2 context 1 buffer 2 Allocating Host buffer for device 2 obuffer 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 3 Allocating temporary device buffer for device 2 context 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 5 Allocating device buffer for device 2 obuffer 1 buffer 6 Allocating device buffer for device 2 obuffer 1 buffer 7 Allocating device buffer for device 2 obuffer 1 buffer 8 Allocating device buffer for device 2 obuffer 1 buffer 9 Allocating device buffer for device 2 obuffer 1 buffer 10 Allocating device buffer for device 2 obuffer 1 buffer 11 Allocating device buffer for device 2 obuffer 1 buffer 12 There was an error in allocating resources and binding them to memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$
Hi rollyng,
I tried to look into this, I plugged three 6970 GPUs in a node but I cannot reproduce the issue you see.
The log you posted tells me that the AMD runtime is unable to allocate host memory, i.e. I issue a malloc call for a page locked buffer but get an error message.
Could you plase update to the current git revision or apply the attached patch. The debug message will then provide the error code of the API which is needed to analyze this further.
As you said your system only has 4GB of memory you might be running out of page locked memory.
you can try to use two GPUs and see whether that works with: ./dgemm_bench -z -v -d -Y 2
Regards
--- a/caldgemm.cpp +++ b/caldgemm.cpp @@ -3383,7 +3383,7 @@ int caldgemm::SetupData(CALmodule *module, CALresource* &_Res, BufferProperties* calResFree(_Res
); } - if (nContext < obuffercount) fprintf(STD_OUT, "There was an error in allocating resources and binding them to memory\n"); + if (nContext < obuffercount) fprintf(STD_OUT, "There was an error in allocating resources and binding them to memory (Error code %d)\n", r); else if (Config->Debug) fprintf(STD_OUT, "No more memory available for bbuffers\n"); return(1); }
HI, I recompiles the lastest with git pull,
with -c -z -d now it gives output:
rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c -z -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Was able to allocate 21 bbuffers Waiting for cblas slave to start Cblas helper thread started Cblas thread Thread, setting CPU mask 80 Waiting for linpack slave to start Using 8 CPU cores at 1600 MHz, 0 GPUs of 0 shaders at 0 MHz Caldgemm Init complete, setting CPU mask 80 Linpack helper thread started Linpack Thread, setting CPU mask 8 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2aedb2bae010, B=0x2aedb4bef010, C=0x2aedb6c00010, (C-A=8430592, (C-B)/w=4104)) Running CPU only DGEMM DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.558 System Gflops 61.684 Uninitializing CALDGEMM Uninitializing CAL runtime Trying to terminate linpack slave Waiting for linpack slave to terminate Waiting for merge threads to terminate linpack slave terminating rolly@rolly-X8DTG-QF:~/caldgemm$
Now with -g -z -d still ends with error:
rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g -z -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 8 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Clearing Memory at 0x2b94b3c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Clearing Memory at 0x2b94b4c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Clearing Memory at 0x2b94b5c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Clearing Memory at 0x2b94b6c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Clearing Memory at 0x3c43120, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Clearing Memory at 0x2b94b7e96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Clearing Memory at 0x2b94b8e96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Clearing Memory at 0x2b94b9e96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Clearing Memory at 0x2b94bae96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Clearing Memory at 0x2b94bc097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Clearing Memory at 0x2b94bd097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Clearing Memory at 0x2b94be097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Clearing Memory at 0x2b94bf097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Clearing Memory at 0x3c70ff0, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Clearing Memory at 0x2b94c0298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Clearing Memory at 0x2b94c1298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Clearing Memory at 0x2b94c2298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Clearing Memory at 0x2b94c3298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Allocating Host buffer for device 2 obuffer 0 buffer 0 Clearing Memory at 0x2b94c4499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 0 Allocating Host buffer for device 2 obuffer 0 buffer 1 Clearing Memory at 0x2b94c5499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 1 Allocating Host buffer for device 2 obuffer 0 buffer 2 Clearing Memory at 0x2b94c6499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 2 Allocating Host buffer for device 2 obuffer 0 buffer 3 Clearing Memory at 0x2b94c7499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 3 Allocating Host memory for device 2 obuffer 0 buffer 4 Clearing Memory at 0x3c9e810, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 5 Allocating device buffer for device 2 obuffer 0 buffer 6 Allocating device buffer for device 2 obuffer 0 buffer 7 Allocating device buffer for device 2 obuffer 0 buffer 8 Allocating device buffer for device 2 obuffer 0 buffer 9 Allocating device buffer for device 2 obuffer 0 buffer 10 Allocating device buffer for device 2 obuffer 0 buffer 11 Allocating device buffer for device 2 obuffer 0 buffer 12 Allocating Host Constant buffer device 2 context 0 buffer 4 Getting module buffer name for device 2 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 20 Allocating Host buffer for device 2 obuffer 1 buffer 0 Clearing Memory at 0x2b94c869a000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 1 buffer 0 Allocating Host buffer for device 2 obuffer 1 buffer 1 Clearing Memory at 0x2b94c969a000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 1 buffer 1 Allocating Host buffer for device 2 obuffer 1 buffer 2 Error 'Operational error' while allocattion of remote memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$
With 2 GPUs it finishes! So does it mean the current ver. of caldgemm cannot run on 4x 6990s (8 GPUs)? Thanks!
rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -z -v -d -Y 2 Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 2 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Clearing Memory at 0x2b40865a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Clearing Memory at 0x2b40875a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Clearing Memory at 0x2b40885a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Clearing Memory at 0x2b40895a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Clearing Memory at 0x3105020, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Clearing Memory at 0x2b408a7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Clearing Memory at 0x2b408b7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Clearing Memory at 0x2b408c7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Clearing Memory at 0x2b408d7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Clearing Memory at 0x2b408e9aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Clearing Memory at 0x2b408f9aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Clearing Memory at 0x2b40909aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Clearing Memory at 0x2b40919aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Clearing Memory at 0x3132ef0, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Clearing Memory at 0x2b4092bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Clearing Memory at 0x2b4093bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Clearing Memory at 0x2b4094bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Clearing Memory at 0x2b4095bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Was able to allocate 21 bbuffers Waiting for linpack slave to start Using 8 CPU cores at 1600 MHz, 2 GPUs of 1536 shaders at 830 MHz Caldgemm Init complete, setting CPU mask 80 Linpack helper thread started Linpack Thread, setting CPU mask 20 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b4096fad010, B=0x2b4098fee010, C=0x2b409afff010, (C-A=8430592, (C-B)/w=4104)) Using Kernel 2 (alpha=0xBFF0000000000000 (-1.000), width = 1024) Caldgemm Main Thread, setting CPU mask 1 Initiliazing GPU Constant Buffers...01 Done GPU Curve Ration: 0.70, CPUScale 0.12, GPUScale 2.34 GPURatio automatically set to 0.98 Favoring m direction, 1 blocks Iteration k = 0, m = 0, n = 0 (device 0 obuffer 0) Running Preprocessing device = 0 k = 0 Dividing Buffer A (device = 0, k = 0, buffer = 0) SRC=0x2b4096fad010, w: 1024, h: 4096, pitch: 1032 (gpuw: 1024, gpuh: 4096, transpose: 0) Dividing Buffer B (device = 0, k = 0, buffer = 0) SRC=0x2b4098fee010, w: 1024, h: 4096, pitch: 4104 (gpuw: 1024, gpuh: 4096, transpose: 1) Copying part of A to GPU (k = 0, m = 0, n = 0) Copying part of B to GPU (k = 0, m = 0, n = 0) Locking obuffer mutex 0/0 Waiting for event from device 0 obuffer 0... Executing MM kernel (device 0 obuffer 0, k=0 m=0 n=0) Total Kernel Time: 0.5996 Processing Output (Iteration 2) for device 0 tile 0 (m = 0, n = 0) Waiting for event from device 0 obuffer 0... Unlocking outputthread mutex 0 to process device 0 obuffer 0 Processing Output (Iteration 3) for device 1 tile 1 (m = 1, n = 0) Waiting for event from device 1 obuffer 0... Processing Output (Iteration 4) for device 0 tile 2 (m = 2, n = 0) Waiting for event from device 0 obuffer 1... Waiting to finish merge process for device 0 obuffer 0 Slave thread 0 (device 0) starting merge process for obuffer 0 (k = 0) Merge time: 0.080 Unlocking mutex device 0 obuffer 0 (Slavethread 0) Waiting to finish merge process for device 1 obuffer 0 Waiting to finish merge process for device 1 obuffer 1 Waiting to finish merge process for device 0 obuffer 2 Waiting to finish merge process for device 1 obuffer 2 Caldgemm Main Thread, setting CPU mask 80 DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.733 System Gflops 46.938 Times: Kernel Divide (1,1) Merge Copy To Copy From 0.5996 (57.2786 Gflops) 0.0213 (3.1549 GB/s) 0.0803 (0.0000 GB/s) 0.0309 (2.1699 GB/s) 0.0000 (0.0000 Gb/s) Uninitializing CALDGEMM Uninitializing buffers for device 0 context 0 Freeing CAL Host memory, device 0 context 0 buffer 0 Freeing CAL Host memory, device 0 context 0 buffer 1 Freeing CAL Host memory, device 0 context 0 buffer 2 Freeing CAL Host memory, device 0 context 0 buffer 3 Freeing CAL Host memory, device 0 context 0 buffer 4 Freeing CAL Host memory, device 0 context 0 buffer 5 Freeing CAL Host memory, device 0 context 0 buffer 6 Freeing CAL Host memory, device 0 context 0 buffer 7 Freeing CAL Host memory, device 0 context 0 buffer 8 Freeing CAL Host memory, device 0 context 0 buffer 9 Freeing CAL Host memory, device 0 context 0 buffer 10 Freeing CAL Host memory, device 0 context 0 buffer 11 Freeing CAL Host memory, device 0 context 0 buffer 12 Freeing CAL GPU memory, device 0 context 0 buffer 0 Freeing CAL GPU memory, device 0 context 0 buffer 1 Freeing CAL GPU memory, device 0 context 0 buffer 2 Freeing CAL GPU memory, device 0 context 0 buffer 3 Freeing CAL GPU memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 5 Freeing CAL GPU memory, device 0 context 0 buffer 6 Freeing CAL GPU memory, device 0 context 0 buffer 7 Freeing CAL GPU memory, device 0 context 0 buffer 8 Freeing CAL GPU memory, device 0 context 0 buffer 9 Freeing CAL GPU memory, device 0 context 0 buffer 10 Freeing CAL GPU memory, device 0 context 0 buffer 11 Freeing CAL GPU memory, device 0 context 0 buffer 12 Trying to terminate merge slave 0 Uninitializing buffers for device 0 context 1 Freeing CAL Host memory, device 0 context 1 buffer 0 merge slave 0 terminating Freeing CAL Host memory, device 0 context 1 buffer 1 Freeing CAL Host memory, device 0 context 1 buffer 2 Freeing CAL Host memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 0 Freeing CAL GPU memory, device 0 context 1 buffer 1 Freeing CAL GPU memory, device 0 context 1 buffer 2 Freeing CAL GPU memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 5 Freeing CAL GPU memory, device 0 context 1 buffer 6 Freeing CAL GPU memory, device 0 context 1 buffer 7 Freeing CAL GPU memory, device 0 context 1 buffer 8 Freeing CAL GPU memory, device 0 context 1 buffer 9 Freeing CAL GPU memory, device 0 context 1 buffer 10 Freeing CAL GPU memory, device 0 context 1 buffer 11 Freeing CAL GPU memory, device 0 context 1 buffer 12 Trying to terminate merge slave 1 Uninitializing buffers for device 0 context 2 Freeing CAL GPU memory, device 0 context 2 buffer 2 merge slave 1 terminating Freeing CAL GPU memory, device 0 context 2 buffer 3 Freeing CAL GPU memory, device 0 context 2 buffer 5 Freeing CAL GPU memory, device 0 context 2 buffer 6 Freeing CAL GPU memory, device 0 context 2 buffer 7 Freeing CAL GPU memory, device 0 context 2 buffer 8 Freeing CAL GPU memory, device 0 context 2 buffer 9 Freeing CAL GPU memory, device 0 context 2 buffer 10 Freeing CAL GPU memory, device 0 context 2 buffer 11 Freeing CAL GPU memory, device 0 context 2 buffer 12 Uninitializing buffers for device 0 context 3 Freeing CAL GPU memory, device 0 context 3 buffer 2 Freeing CAL GPU memory, device 0 context 3 buffer 3 Uninitializing buffers for device 0 context 4 Freeing CAL GPU memory, device 0 context 4 buffer 2 Freeing CAL GPU memory, device 0 context 4 buffer 3 Uninitializing buffers for device 0 context 5 Freeing CAL GPU memory, device 0 context 5 buffer 2 Freeing CAL GPU memory, device 0 context 5 buffer 3 Uninitializing buffers for device 0 context 6 Freeing CAL GPU memory, device 0 context 6 buffer 2 Freeing CAL GPU memory, device 0 context 6 buffer 3 Uninitializing buffers for device 0 context 7 Freeing CAL GPU memory, device 0 context 7 buffer 2 Freeing CAL GPU memory, device 0 context 7 buffer 3 Uninitializing buffers for device 0 context 8 Freeing CAL GPU memory, device 0 context 8 buffer 2 Freeing CAL GPU memory, device 0 context 8 buffer 3 Uninitializing buffers for device 0 context 9 Freeing CAL GPU memory, device 0 context 9 buffer 2 Freeing CAL GPU memory, device 0 context 9 buffer 3 Uninitializing buffers for device 0 context 10 Freeing CAL GPU memory, device 0 context 10 buffer 2 Freeing CAL GPU memory, device 0 context 10 buffer 3 Uninitializing buffers for device 0 context 11 Freeing CAL GPU memory, device 0 context 11 buffer 2 Freeing CAL GPU memory, device 0 context 11 buffer 3 Uninitializing buffers for device 0 context 12 Freeing CAL GPU memory, device 0 context 12 buffer 2 Freeing CAL GPU memory, device 0 context 12 buffer 3 Uninitializing buffers for device 0 context 13 Freeing CAL GPU memory, device 0 context 13 buffer 2 Freeing CAL GPU memory, device 0 context 13 buffer 3 Uninitializing buffers for device 0 context 14 Freeing CAL GPU memory, device 0 context 14 buffer 2 Freeing CAL GPU memory, device 0 context 14 buffer 3 Uninitializing buffers for device 0 context 15 Freeing CAL GPU memory, device 0 context 15 buffer 2 Freeing CAL GPU memory, device 0 context 15 buffer 3 Uninitializing buffers for device 0 context 16 Freeing CAL GPU memory, device 0 context 16 buffer 2 Freeing CAL GPU memory, device 0 context 16 buffer 3 Uninitializing buffers for device 0 context 17 Freeing CAL GPU memory, device 0 context 17 buffer 2 Freeing CAL GPU memory, device 0 context 17 buffer 3 Uninitializing buffers for device 0 context 18 Freeing CAL GPU memory, device 0 context 18 buffer 2 Freeing CAL GPU memory, device 0 context 18 buffer 3 Uninitializing buffers for device 0 context 19 Freeing CAL GPU memory, device 0 context 19 buffer 2 Freeing CAL GPU memory, device 0 context 19 buffer 3 Uninitializing buffers for device 0 context 20 Freeing CAL GPU memory, device 0 context 20 buffer 2 Freeing CAL GPU memory, device 0 context 20 buffer 3 Uninitializing buffers for device 1 context 0 Freeing CAL Host memory, device 1 context 0 buffer 0 Freeing CAL Host memory, device 1 context 0 buffer 1 Freeing CAL Host memory, device 1 context 0 buffer 2 Freeing CAL Host memory, device 1 context 0 buffer 3 Freeing CAL Host memory, device 1 context 0 buffer 4 Freeing CAL GPU memory, device 1 context 0 buffer 0 Freeing CAL GPU memory, device 1 context 0 buffer 1 Freeing CAL GPU memory, device 1 context 0 buffer 2 Freeing CAL GPU memory, device 1 context 0 buffer 3 Freeing CAL GPU memory, device 1 context 0 buffer 4 Freeing CAL GPU memory, device 1 context 0 buffer 5 Freeing CAL GPU memory, device 1 context 0 buffer 6 Freeing CAL GPU memory, device 1 context 0 buffer 7 Freeing CAL GPU memory, device 1 context 0 buffer 8 Freeing CAL GPU memory, device 1 context 0 buffer 9 Freeing CAL GPU memory, device 1 context 0 buffer 10 Freeing CAL GPU memory, device 1 context 0 buffer 11 Freeing CAL GPU memory, device 1 context 0 buffer 12 Trying to terminate merge slave 0 Uninitializing buffers for device 1 context 1 Freeing CAL Host memory, device 1 context 1 buffer 0 merge slave 0 terminating Freeing CAL Host memory, device 1 context 1 buffer 1 Freeing CAL Host memory, device 1 context 1 buffer 2 Freeing CAL Host memory, device 1 context 1 buffer 3 Freeing CAL GPU memory, device 1 context 1 buffer 0 Freeing CAL GPU memory, device 1 context 1 buffer 1 Freeing CAL GPU memory, device 1 context 1 buffer 2 Freeing CAL GPU memory, device 1 context 1 buffer 3 Freeing CAL GPU memory, device 1 context 1 buffer 5 Freeing CAL GPU memory, device 1 context 1 buffer 6 Freeing CAL GPU memory, device 1 context 1 buffer 7 Freeing CAL GPU memory, device 1 context 1 buffer 8 Freeing CAL GPU memory, device 1 context 1 buffer 9 Freeing CAL GPU memory, device 1 context 1 buffer 10 Freeing CAL GPU memory, device 1 context 1 buffer 11 Freeing CAL GPU memory, device 1 context 1 buffer 12 Trying to terminate merge slave 1 Uninitializing buffers for device 1 context 2 Freeing CAL GPU memory, device 1 context 2 buffer 2 merge slave 1 terminating Freeing CAL GPU memory, device 1 context 2 buffer 3 Freeing CAL GPU memory, device 1 context 2 buffer 5 Freeing CAL GPU memory, device 1 context 2 buffer 6 Freeing CAL GPU memory, device 1 context 2 buffer 7 Freeing CAL GPU memory, device 1 context 2 buffer 8 Freeing CAL GPU memory, device 1 context 2 buffer 9 Freeing CAL GPU memory, device 1 context 2 buffer 10 Freeing CAL GPU memory, device 1 context 2 buffer 11 Freeing CAL GPU memory, device 1 context 2 buffer 12 Uninitializing buffers for device 1 context 3 Freeing CAL GPU memory, device 1 context 3 buffer 2 Freeing CAL GPU memory, device 1 context 3 buffer 3 Uninitializing buffers for device 1 context 4 Freeing CAL GPU memory, device 1 context 4 buffer 2 Freeing CAL GPU memory, device 1 context 4 buffer 3 Uninitializing buffers for device 1 context 5 Freeing CAL GPU memory, device 1 context 5 buffer 2 Freeing CAL GPU memory, device 1 context 5 buffer 3 Uninitializing buffers for device 1 context 6 Freeing CAL GPU memory, device 1 context 6 buffer 2 Freeing CAL GPU memory, device 1 context 6 buffer 3 Uninitializing buffers for device 1 context 7 Freeing CAL GPU memory, device 1 context 7 buffer 2 Freeing CAL GPU memory, device 1 context 7 buffer 3 Uninitializing buffers for device 1 context 8 Freeing CAL GPU memory, device 1 context 8 buffer 2 Freeing CAL GPU memory, device 1 context 8 buffer 3 Uninitializing buffers for device 1 context 9 Freeing CAL GPU memory, device 1 context 9 buffer 2 Freeing CAL GPU memory, device 1 context 9 buffer 3 Uninitializing buffers for device 1 context 10 Freeing CAL GPU memory, device 1 context 10 buffer 2 Freeing CAL GPU memory, device 1 context 10 buffer 3 Uninitializing buffers for device 1 context 11 Freeing CAL GPU memory, device 1 context 11 buffer 2 Freeing CAL GPU memory, device 1 context 11 buffer 3 Uninitializing buffers for device 1 context 12 Freeing CAL GPU memory, device 1 context 12 buffer 2 Freeing CAL GPU memory, device 1 context 12 buffer 3 Uninitializing buffers for device 1 context 13 Freeing CAL GPU memory, device 1 context 13 buffer 2 Freeing CAL GPU memory, device 1 context 13 buffer 3 Uninitializing buffers for device 1 context 14 Freeing CAL GPU memory, device 1 context 14 buffer 2 Freeing CAL GPU memory, device 1 context 14 buffer 3 Uninitializing buffers for device 1 context 15 Freeing CAL GPU memory, device 1 context 15 buffer 2 Freeing CAL GPU memory, device 1 context 15 buffer 3 Uninitializing buffers for device 1 context 16 Freeing CAL GPU memory, device 1 context 16 buffer 2 Freeing CAL GPU memory, device 1 context 16 buffer 3 Uninitializing buffers for device 1 context 17 Freeing CAL GPU memory, device 1 context 17 buffer 2 Freeing CAL GPU memory, device 1 context 17 buffer 3 Uninitializing buffers for device 1 context 18 Freeing CAL GPU memory, device 1 context 18 buffer 2 Freeing CAL GPU memory, device 1 context 18 buffer 3 Uninitializing buffers for device 1 context 19 Freeing CAL GPU memory, device 1 context 19 buffer 2 Freeing CAL GPU memory, device 1 context 19 buffer 3 Uninitializing buffers for device 1 context 20 Freeing CAL GPU memory, device 1 context 20 buffer 2 Freeing CAL GPU memory, device 1 context 20 buffer 3 Uninitializing context for device 0 Uninitializing context for device 1 Uninitializing CAL runtime Trying to terminate linpack slave Waiting for linpack slave to terminate Waiting for merge threads to terminate linpack slave terminating rolly@rolly-X8DTG-QF:~/caldgemm$
I think caldgemm currently requires 2 to 3 CPU-Cores per GPU (would have to check the source), so yes, on your CPUs it probably won't be able to support more than two 6990s.
This is to some extend owed to the fact that we currently use Magny-Cours-CPUs -> Plenty of cores.
is there any place where i can see a performance comparison between caldgemm and clAmdBlasDgemm ?
Originally posted by: laobrasuca is there any place where i can see a performance comparison between caldgemm and clAmdBlasDgemm ?
Hi, I did some test with acmlgpu1.1.2. as I run the Info.exe, it shows
rolly@rolly-X8DTG-QF:/opt/acmlgpu1.1.2/GPGPUexamples$ ./Info.exe CPUID: function (0) Vendor: GenuineIntel function (1) Family-Model-Stepping: 6-44-2 Feature flags (EDX): BFEBFBFFh Feature flags (ECX): 009EE3FDh MMX (EDX bit 13): yes SSE1 (EDX bit 25): yes SSE2 (EDX bit 26): yes SSE3 (ECX bit 0): yes SSSE3 (ECX bit 9): yes SSE4.1 (ECX bit 19): yes SSE4.2 (ECX bit 20): yes AVX (ECX bit 28): no function (8000_0004) Processor Brand: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz > uname -a Linux rolly-X8DTG-QF 2.6.35-28-generic #50-Ubuntu SMP Fri Mar 18 18:42:20 UTC 2011 x86_64 GNU/Linux > powersave -c sh: powersave: not found CAL RT version: 1.4.1385 CAL CL version: 1.4.1385 gpu0: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu1: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu2: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu3: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu4: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu5: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu6: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu7: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported GPUs found: 8
However, as I run this time_dgemm.exe, it looks like I am hitting the same wall, it just can make use of 3 out of 8 GPUs... but I have 32GB of host memory?
rolly@rolly-X8DTG-QF:/opt/acmlgpu1.1.2/GPGPUexamples$ ./time_dgemm.exe Matrix Time in Performance Size Seconds in Megaflops ------ ------------ ------------ ERROR: gpu3 - unable to allocate minimum cached system (GART) memory gpu3 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu4 - unable to allocate minimum cached system (GART) memory gpu4 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu5 - unable to allocate minimum cached system (GART) memory gpu5 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu6 - unable to allocate minimum cached system (GART) memory gpu6 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu7 - unable to allocate minimum cached system (GART) memory gpu7 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1728 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 472 MB 5242880 ( 5 MB) FAILED WARNING: 5 out of 8 GPUs failed to initialize; proceeding with other(s). 400 2.250818 56 600 0.045632 9467 800 0.049524 20676 1000 0.068471 29209 1200 0.086970 39737 1400 0.109446 50143 1600 0.141187 58022 1800 0.177105 65859 2000 0.206845 77352 2200 0.234911 90655 2400 0.259695 106463 2600 0.290227 121118 2800 0.331030 132628 3000 0.377459 143061 3200 0.361680 181198 3400 0.395542 198735 3600 0.431999 216000 3800 0.467440 234776 4000 0.520821 245765 4200 0.566723 261460 4400 0.618775 275331 4600 0.671366 289963 4800 0.736608 300273 5000 0.801185 312037 5200 0.888577 316479 5400 0.937255 336011 5600 1.007444 348636 5800 1.065766 366144 6000 1.155561 373844 6200 1.212021 393273 6400 1.287458 407227 6600 1.331005 431998 6800 1.375383 457228 7000 1.467186 467561 7200 1.515511 492570 7400 1.626788 498189 7600 1.744087 503387 7800 1.843866 514735 8000 1.918230 533825
Hi, I did some test with acmlgpu1.1.2. as I run the Info.exe, it shows
hi there, what's the difference between ACML-GPU and clAmdBlas? Would it be that one is CAL and the other OpenCL? And what about performance (at least for single GPU setup)?
Hi, I think you are right. clAmdBlas needs OpenCL but I find that there is only sgemm example for clAmdBlas, so I may not be able to compare dgemm performance of the two libraries?
yes, there's only the sgemm example (but be aware that this example has a typo fault - matrix A is written to the bufB - check this post http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=150952&enterthread=y), but, well, they perform exactly the same mathematical operations except for the data type (double instead of float), so I believe you can use the exact same example changing the types only and having a card that supports double precision computations (like the 6790 of yours).
if you comprare performance results with caldgemm, please let us know.
OK, let's have the acmlgpu-1.1.2 first for both dgemm and sgemm on single HD6970.
rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_dgemm.exe
Matrix Time in Performance
Size Seconds in Megaflops
------ ------------ ------------
400 0.758880 168
600 0.030139 14333
800 0.035727 28662
1000 0.049771 40184
1200 0.067966 50849
1400 0.068995 79541
1600 0.086494 94711
1800 0.111565 104549
2000 0.134369 119075
2200 0.161584 131795
2400 0.184988 149458
2600 0.214080 164200
2800 0.241470 181819
3000 0.288899 186916
3200 0.333995 196218
3400 0.415724 189087
3600 0.494598 188662
3800 0.565289 194137
4000 0.662003 193352
4200 0.762732 194270
4400 0.829545 205375
4600 0.866307 224714
4800 0.953990 231851
5000 1.100728 227122
5200 1.219832 230536
5400 1.368855 230066
5600 1.569290 223815
5800 1.789924 218011
6000 2.022892 213555
6200 1.990262 239494
6400 2.092950 250501
6600 2.339395 245786
6800 2.555408 246091
7000 2.725500 251696
7200 2.983605 250199
7400 3.627176 223437
7600 3.460915 253676
7800 3.715710 255430
8000 4.046330 253068
rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_sgemm.exe
Matrix Time in Performance
Size Seconds in Megaflops
------ ------------ ------------
400 0.711834 179
600 0.021887 19738
800 0.029878 34273
1000 0.030939 64643
1200 0.035729 96728
1400 0.042776 128295
1600 0.050936 160829
1800 0.061199 190590
2000 0.071055 225176
2200 0.083770 254220
2400 0.092456 299038
2600 0.108912 322755
2800 0.122735 357714
3000 0.132517 407495
3200 0.158700 412954
3400 0.190118 413468
3600 0.211197 441824
3800 0.244340 449143
4000 0.273162 468585
4200 0.352204 420711
4400 0.388507 438519
4600 0.404213 481607
4800 0.450925 490511
5000 0.492675 507434
5200 0.560158 502030
5400 0.652638 482546
5600 0.721320 486929
5800 0.721891 540558
6000 0.828848 521205
6200 0.966798 493025
6400 0.994621 527123
6600 1.134357 506888
6800 1.240944 506762
7000 1.281981 535109
7200 1.296301 575866
7400 1.446126 560427
7600 1.502031 584510
7800 1.653382 574038
8000 1.928930 530864
rolly@rolly-p5q-pro:~/GPGPUexamples$
Now the caldgemm,
rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 4096 -n 4096
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b58989a8010, B=0x2b589a9e9010, C=0x2b589c9fa010, (C-A=8430592, (C-B)/w=4104))
Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.208 System Gflops 165.602
rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 8192 -n 8192
Use -? for help
Cannot use multiple devices without multithreading
Was able to allocate 21 bbuffers
Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
Doing initial run... Done
Initializing Matrix C
Running Benchmark
Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2b1edc693010, B=0x2b1ee0714010, C=0x2b1ee4725010, (C-A=16851968, (C-B)/w=8200))
Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 0.581 System Gflops 236.899
What I can conclude so far:
(1) Only 1 HD6990 can only run on these libraries no matter how many extra of these are installed?
(2) acml-gpu looks having batter performance?
Thanks for reading!
(1) You mean 1 HD6970, right? Well, it seems to be it, since the performance is roughly 1/5 of the nominal TERAFLOP number in the best of the two libraries. Can anyone confirm this?
(2) There's something I didn't understand. Is the matrices size for acmlgpu-1.1.2 equal to 8000x8000? Because caldgemm do the product for 8192x1024, which is a huge difference. Since the size of the matrix has a visible influence on the Gflops, we can hardly compare these results. Could you re-run them for comparable matrix sizes? However, I don't know if the algorithm is optimized for square matrices or not.
And, if you have a little more of time, could you test the clAmdBlasDegmm (and maybe clAmdBlasSegmm to compare to time_sgemm.exe)?
Originally posted by: Marix I think caldgemm currently requires 2 to 3 CPU-Cores per GPU (would have to check the source), so yes, on your CPUs it probably won't be able to support more than two 6990s.
This is to some extend owed to the fact that we currently use Magny-Cours-CPUs -> Plenty of cores.
Hi Marix, thanks for your clarification, I have 2 E5620 on my system with Hyperthread enabled, so system monitor shows 16 CPUs and I should have 2 CPUs per Cayman GPU. Is this still insufficient for the caldgemm requirement? I believed the max CPU cores per node is 24 with Intel 1366 pin processors, so that makes 3 CPUs per Cayman...
I am having similar memory issues. I am running an HD5870. I tried following the instructions given on the Wiki.
Here are things I ddi not do:
Instead of using git, I just downloaded/unzipped the latest version from the Files Page
-march=native didn't work so I just deleted it from makefile so that it can compile error-free.
I did not use the binary patch for the Catalyst driver.
Here is my output given your instructions of ./dgemm_bench -g -z -v -d
./dgemm_bench -g -z -v -d
Use -? for help
Init Caldgemm, setting CPU mask 1
CAL Runtime Version:1.4.1016
Initializing CAL
Initializing CALDGEMM for 1 devices
Allocating Host buffer for device 0 obuffer 0 buffer 0
Allocating device buffer for device 0 obuffer 0 buffer 0
Allocating Host buffer for device 0 obuffer 0 buffer 1
Allocating device buffer for device 0 obuffer 0 buffer 1
Allocating Host buffer for device 0 obuffer 0 buffer 2
Allocating device buffer for device 0 obuffer 0 buffer 2
Allocating Host buffer for device 0 obuffer 0 buffer 3
Allocating device buffer for device 0 obuffer 0 buffer 3
Allocating Host memory for device 0 obuffer 0 buffer 4
Allocating device buffer for device 0 obuffer 0 buffer 5
Allocating device buffer for device 0 obuffer 0 buffer 6
Allocating device buffer for device 0 obuffer 0 buffer 7
Allocating device buffer for device 0 obuffer 0 buffer 8
Allocating device buffer for device 0 obuffer 0 buffer 9
Allocating device buffer for device 0 obuffer 0 buffer 10
Allocating device buffer for device 0 obuffer 0 buffer 11
Allocating device buffer for device 0 obuffer 0 buffer 12
Allocating Host Constant buffer device 0 context 0 buffer 4
Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7
Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7
Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0
Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1
Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2
Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3
Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0
Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0
Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1
Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2
Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3
Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4
Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5
Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6
Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7
Merger Thread 0 started
Merge Thread 0, setting CPU mask 2
Allocating Host buffer for device 0 obuffer 1 buffer 0
Allocating device buffer for device 0 obuffer 1 buffer 0
Allocating Host buffer for device 0 obuffer 1 buffer 1
Allocating device buffer for device 0 obuffer 1 buffer 1
Allocating Host buffer for device 0 obuffer 1 buffer 2
Allocating device buffer for device 0 obuffer 1 buffer 2
Allocating Host buffer for device 0 obuffer 1 buffer 3
Allocating device buffer for device 0 obuffer 1 buffer 3
Allocating device buffer for device 0 obuffer 1 buffer 5
Allocating device buffer for device 0 obuffer 1 buffer 6
Allocating device buffer for device 0 obuffer 1 buffer 7
Allocating device buffer for device 0 obuffer 1 buffer 8
Allocating device buffer for device 0 obuffer 1 buffer 9
Allocating device buffer for device 0 obuffer 1 buffer 10
There was an error in allocating resources and binding them to memory
Error initializing CALDGEMM
Thanks