27 Replies Latest reply on Jul 28, 2011 5:19 PM by Gametimehero

    caldgemm with HD6990s

    rollyng
      Can I run caldgemm on HD6990s?

      Hi,

      I am sorry that I cannot find the caldgemm forum so I post here and hopefully its developers read this message.

      I have 4 HD6990s and I really like to see how they perform in GFLOPs, so I come across this tool, http://code.compeng.uni-frankfurt.de/projects/caldgemm/wiki

      I have Ubuntu 10.10 x86_64 + AMD driver 11.5 + SDK 2.4 and I followed the instructions on the wiki, however I have to add "make TARGET=NEHALEM NO_MEMPOLICY=1 -j" in order to compile GotoBLAS2 since I have two E5620 CPUs.

      I compile caldgemm, its outputs look fine.

      But If I run it, it prompts error and hangs idle, can anyone help? Thanks!

      rolly@rolly-X8DTG-QF:~/caldgemm$ make g++ -c caldgemm.cpp -Wfloat-equal -Wpointer-arith -DATI_OS_LINUX -g3 -ffor-scope -O3 -march=core2 -ftree-vectorize -msse3 -fkeep-inline-functions -fweb -frename-registers -minline-all-stringops -funit-at-a-time -mfpmath=sse -ftracer -finline-limit=1200 -fpeel-loops -D_NO_AMD_CPU -I ../GotoBLAS2 -I /home/rolly/AMD-APP-SDK-v2.4-lnx64/include/CAL caldgemm.cpp: In function ‘void* divide_wrapper(void*)’: caldgemm.cpp:1661: warning: format ‘%lld’ expects type ‘long long int’, but argument 4 has type ‘int’ caldgemm.cpp: In member function ‘int caldgemm::RunCALDGEMM(double*, double*, double*, double, double, size_t, size_t, size_t, size_t, size_t, size_t, CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int)’: caldgemm.cpp:2260: warning: format ‘%lld’ expects type ‘long long int’, but argument 3 has type ‘size_t’ caldgemm.cpp:2269: warning: format ‘%lld’ expects type ‘long long int’, but argument 3 has type ‘size_t’ caldgemm.cpp:2403: warning: format ‘%lld’ expects type ‘long long int’, but argument 4 has type ‘size_t’ caldgemm.cpp:2421: warning: format ‘%lld’ expects type ‘long long int’, but argument 4 has type ‘size_t’ caldgemm.cpp: In member function ‘int caldgemm::DGEMM_prepare(size_t, int, unsigned int)’: caldgemm.cpp:2738: warning: format ‘%d’ expects type ‘int’, but argument 4 has type ‘size_t’ caldgemm.cpp:2761: warning: format ‘%d’ expects type ‘int’, but argument 4 has type ‘size_t’ g++ -c benchmark.cpp -Wfloat-equal -Wpointer-arith -DATI_OS_LINUX -g3 -ffor-scope -O3 -march=core2 -ftree-vectorize -msse3 -fkeep-inline-functions -fweb -frename-registers -minline-all-stringops -funit-at-a-time -mfpmath=sse -ftracer -finline-limit=1200 -fpeel-loops -D_NO_AMD_CPU -I ../GotoBLAS2 -I /home/rolly/AMD-APP-SDK-v2.4-lnx64/include/CAL g++ -o dgemm_bench caldgemm.o benchmark.o -lpthread -ldl -L/usr/X11R6/lib -laticalrt -laticalcl -lgfortran ../GotoBLAS2/libgoto2.a rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c Use -? for help Cannot use multiple devices without multithreading Segmentation fault rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -z Use -? for help There was an error in allocating resources and binding them to memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc AERROR locking Pages ...alloc BERROR locking Pages ...alloc CERROR locking Pages Memory allocation error allocating matrices

        • caldgemm with HD6990s
          Marix

          There currently is no forum for caldgemm, however there is a (low volume) mailing list at https://compeng.uni-frankfurt.de/mailman/listinfo/caldgemm .

          What I can see from your output ist the following:

          • Use -z if you want to use both GPUs.
          • Memory allocation fails. I assume your ulimit for max locked memory is too low. For machines on which you want to benchmark it makes sense to set "ulimit -l unlimited". You might also want to specify that in /etc/security/limits.conf. 
            • caldgemm with HD6990s
              rollyng

              Hi Marix,

              if I do

              rolly@rolly-X8DTG-QF:~$ ulimit -a
              core file size          (blocks, -c) unlimited
              data seg size           (kbytes, -d) unlimited
              scheduling priority             (-e) 20
              file size               (blocks, -f) unlimited
              pending signals                 (-i) 16382
              max locked memory       (kbytes, -l) 64
              max memory size         (kbytes, -m) unlimited
              open files                      (-n) 1024
              pipe size            (512 bytes, -p) 8
              POSIX message queues     (bytes, -q) 819200
              real-time priority              (-r) 0
              stack size              (kbytes, -s) unlimited
              cpu time               (seconds, -t) unlimited
              max user processes              (-u) unlimited
              virtual memory          (kbytes, -v) unlimited
              file locks                      (-x) unlimited

              So I made some change to /etc/security/limits.conf

              http://www.akadia.com/services/ora_enable_core.html

              now I can change ulimit -l unlimited and it looks like

              rolly@rolly-X8DTG-QF:~/caldgemm$ ulimit -a
              core file size          (blocks, -c) unlimited
              data seg size           (kbytes, -d) unlimited
              scheduling priority             (-e) 20
              file size               (blocks, -f) unlimited
              pending signals                 (-i) 16382
              max locked memory       (kbytes, -l) unlimited
              max memory size         (kbytes, -m) unlimited
              open files                      (-n) 1024
              pipe size            (512 bytes, -p) 8
              POSIX message queues     (bytes, -q) 819200
              real-time priority              (-r) 0
              stack size              (kbytes, -s) unlimited
              cpu time               (seconds, -t) unlimited
              max user processes              (-u) unlimited
              virtual memory          (kbytes, -v) unlimited
              file locks                      (-x) unlimited

              Now running benchmark,

              rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench
              Use -? for help
              Cannot use multiple devices without multithreading
              Was able to allocate 21 bbuffers
              Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
              Doing initial run... Done
              Initializing Matrix C
              Running Benchmark
              Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b2392afc010, B=0x2b2394b3d010, C=0x2b2396b4e010, (C-A=8430592, (C-B)/w=4104))
              Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.656 System Gflops 52.459

              But with -z option, it still failed,

              rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -z
              Use -? for help
              There was an error in allocating resources and binding them to memory
              Error initializing CALDGEMM

              Any hint on the last error? Thank you!

                • caldgemm with HD6990s
                  rollyng

                  HI Marix,

                  Further update to my problem, I think this is due to multiple GPU issues. I did the same for another system with same software config but just a single HD6970, it nnow produce the reasonable results:

                  Is this true that the system performance just 164 GFLOPS vs kernel 465 GFLOPS for a single GPU HD6970.

                  For my 4x HD6990s, the -g parameter does not work at all...

                  Thanks!

                  rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -c Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ab9dea74010, B=0x2ab9e0ab5010, C=0x2ab9e2ac6010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 1.652 System Gflops 20.822 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ad9ccb3c010, B=0x2ad9ceb7d010, C=0x2ad9d0b8e010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.210 System Gflops 163.418 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -v Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ad235f2a010, B=0x2ad237f6b010, C=0x2ad239f7c010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.210 System Gflops 163.892 Times: Kernel Divide (1,1) Merge Copy To Copy From 0.0737 (465.7270 Gflops) 0.0296 (2.2666 GB/s) 0.0934 (1.4375 GB/s) 0.0128 (5.2401 GB/s) 0.0000 (0.0000 Gb/s)

                    • caldgemm with HD6990s
                      Marix

                      Regarding the low system performance. There is currently a known performance issue with all HD6000 series devices. It can be tuned around and there is a new version with some proper workarounds in the queue. However the copy speeds look aktually quite good in your case. Your matrix size is, however, rather small. Why you should get quite some performance at that size it would be interesting to see what you can reach at 20k or even 40k for m and n (k may stay at 1024).

                        • caldgemm with HD6990s
                          rollyng

                           

                          Originally posted by: Marix Regarding the low system performance. There is currently a known performance issue with all HD6000 series devices. It can be tuned around and there is a new version with some proper workarounds in the queue. However the copy speeds look aktually quite good in your case. Your matrix size is, however, rather small. Why you should get quite some performance at that size it would be interesting to see what you can reach at 20k or even 40k for m and n (k may stay at 1024).

                           

                          Hi Marix, thanks for your info, I rerun the test this time on the single HD6970 with 4GB host memory, so I can only run m=n=16384.

                          Please have a look at the output., the best I get is 212 GFLOPS. Thank you

                           

                          rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ae9c5ae4010, B=0x2ae9c7b25010, C=0x2ae9c9b36010, (C-A=8430592, (C-B)/w=4104)) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.328 System Gflops 104.980 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -m 8192 -n 8192 Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2af03c0d3010, B=0x2af040154010, C=0x2af044165010, (C-A=16851968, (C-B)/w=8200)) Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 1.003 System Gflops 137.169 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -m 16384 -n 16384 Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=16384 k=1024 n=16384 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x4008 LDC=0x4008 At=0 Bt=0 ColMajor=0 (A=0x2b02a29f4010, B=0x2b02aaaf5010, C=0x2b02b2b06010, (C-A=33694720, (C-B)/w=16392)) Program: caldgemm Sizes - A: 16384x1024 B: 1024x16384 C:16384x16384 (Host: rolly-p5q-pro) System Time 3.640 System Gflops 151.174 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -m 32768 -n 32768 Use -? for help Cannot use multiple devices without multithreading Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc Cterminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped) rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -m 8192 -n 8192 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2b2177e92010, B=0x2b217bf13010, C=0x2b217ff24010, (C-A=16851968, (C-B)/w=8200)) Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 0.748 System Gflops 184.007 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -m 16384 -n 16384 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=16384 k=1024 n=16384 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x4008 LDC=0x4008 At=0 Bt=0 ColMajor=0 (A=0x2ba071595010, B=0x2ba079696010, C=0x2ba0816a7010, (C-A=33694720, (C-B)/w=16392)) Program: caldgemm Sizes - A: 16384x1024 B: 1024x16384 C:16384x16384 (Host: rolly-p5q-pro) System Time 2.587 System Gflops 212.735 rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -m 32768 -n 32768 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc Cterminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped) rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -g -z -v -m 16384 -n 16384 Use -? for help Was able to allocate 21 bbuffers Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=16384 k=1024 n=16384 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x4008 LDC=0x4008 At=0 Bt=0 ColMajor=0 (A=0x2b344bd96010, B=0x2b3453e97010, C=0x2b345bea8010, (C-A=33694720, (C-B)/w=16392)) Program: caldgemm Sizes - A: 16384x1024 B: 1024x16384 C:16384x16384 (Host: rolly-p5q-pro) System Time 2.949 System Gflops 186.577 Times: Kernel Divide (4,4) Merge Copy To Copy From 1.2474 (440.5147 Gflops) 0.2862 (0.9380 GB/s) 0.4570 (0.0000 GB/s) 0.1072 (2.5045 GB/s) 0.0000 (0.0000 Gb/s)

                        • caldgemm with HD6990s
                          drohr

                          Hi rollyng,

                          as marix said there is an issue related to 6000 series GPU that decreases system performance dramatically. However, the -z parameter should actually work.

                          to help debugging this problem can you do the following:

                          activate the DEBUG_MSG_ALLOCATION swith in caldgemm_config.h

                          set the STD_OUT parameter to stderr in caldgemm_config.h

                          run dgemm_bench -g -z -v -d and paste the output.

                          can you please also tell me exactly which version you are using?

                          Cheers

                            • caldgemm with HD6990s
                              rollyng

                              Hi David,

                              Thanks for your message, I did the following for the cpu run, please take a look first.

                              rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Cannot use multiple devices without multithreading Initializing CALDGEMM for 1 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Was able to allocate 21 bbuffers Using 8 CPU cores at 2401 MHz, 1 GPUs of 1536 shaders at 830 MHz Caldgemm Init complete, setting CPU mask 80 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2afa3f516010, B=0x2afa41557010, C=0x2afa43568010, (C-A=8430592, (C-B)/w=4104)) Running CPU only DGEMM DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.542 System Gflops 63.429 Times: Kernel Divide (0,0) Merge Copy To Copy From 0.0000 (inf Gflops) 0.0000 (-nan GB/s) 0.0000 (inf GB/s) 0.0000 (-nan GB/s) 0.0000 (0.0000 Gb/s) Uninitializing CALDGEMM Uninitializing buffers for device 0 context 0 Freeing CAL Host memory, device 0 context 0 buffer 0 Freeing temporary CAL memory, device 0 context 0 buffer 0 Freeing CAL Host memory, device 0 context 0 buffer 1 Freeing temporary CAL memory, device 0 context 0 buffer 1 Freeing CAL Host memory, device 0 context 0 buffer 2 Freeing temporary CAL memory, device 0 context 0 buffer 2 Freeing CAL Host memory, device 0 context 0 buffer 3 Freeing temporary CAL memory, device 0 context 0 buffer 3 Freeing CAL Host memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 0 Freeing CAL GPU memory, device 0 context 0 buffer 1 Freeing CAL GPU memory, device 0 context 0 buffer 2 Freeing CAL GPU memory, device 0 context 0 buffer 3 Freeing CAL GPU memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 5 Freeing CAL GPU memory, device 0 context 0 buffer 6 Freeing CAL GPU memory, device 0 context 0 buffer 7 Freeing CAL GPU memory, device 0 context 0 buffer 8 Freeing CAL GPU memory, device 0 context 0 buffer 9 Freeing CAL GPU memory, device 0 context 0 buffer 10 Freeing CAL GPU memory, device 0 context 0 buffer 11 Freeing CAL GPU memory, device 0 context 0 buffer 12 Uninitializing buffers for device 0 context 1 Freeing CAL Host memory, device 0 context 1 buffer 0 Freeing temporary CAL memory, device 0 context 1 buffer 0 Freeing CAL Host memory, device 0 context 1 buffer 1 Freeing temporary CAL memory, device 0 context 1 buffer 1 Freeing CAL Host memory, device 0 context 1 buffer 2 Freeing temporary CAL memory, device 0 context 1 buffer 2 Freeing CAL Host memory, device 0 context 1 buffer 3 Freeing temporary CAL memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 0 Freeing CAL GPU memory, device 0 context 1 buffer 1 Freeing CAL GPU memory, device 0 context 1 buffer 2 Freeing CAL GPU memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 5 Freeing CAL GPU memory, device 0 context 1 buffer 6 Freeing CAL GPU memory, device 0 context 1 buffer 7 Freeing CAL GPU memory, device 0 context 1 buffer 8 Freeing CAL GPU memory, device 0 context 1 buffer 9 Freeing CAL GPU memory, device 0 context 1 buffer 10 Freeing CAL GPU memory, device 0 context 1 buffer 11 Freeing CAL GPU memory, device 0 context 1 buffer 12 Uninitializing buffers for device 0 context 2 Freeing CAL GPU memory, device 0 context 2 buffer 2 Freeing CAL GPU memory, device 0 context 2 buffer 3 Freeing CAL GPU memory, device 0 context 2 buffer 5 Freeing CAL GPU memory, device 0 context 2 buffer 6 Freeing CAL GPU memory, device 0 context 2 buffer 7 Freeing CAL GPU memory, device 0 context 2 buffer 8 Freeing CAL GPU memory, device 0 context 2 buffer 9 Freeing CAL GPU memory, device 0 context 2 buffer 10 Freeing CAL GPU memory, device 0 context 2 buffer 11 Freeing CAL GPU memory, device 0 context 2 buffer 12 Uninitializing buffers for device 0 context 3 Freeing CAL GPU memory, device 0 context 3 buffer 2 Freeing CAL GPU memory, device 0 context 3 buffer 3 Uninitializing buffers for device 0 context 4 Freeing CAL GPU memory, device 0 context 4 buffer 2 Freeing CAL GPU memory, device 0 context 4 buffer 3 Uninitializing buffers for device 0 context 5 Freeing CAL GPU memory, device 0 context 5 buffer 2 Freeing CAL GPU memory, device 0 context 5 buffer 3 Uninitializing buffers for device 0 context 6 Freeing CAL GPU memory, device 0 context 6 buffer 2 Freeing CAL GPU memory, device 0 context 6 buffer 3 Uninitializing buffers for device 0 context 7 Freeing CAL GPU memory, device 0 context 7 buffer 2 Freeing CAL GPU memory, device 0 context 7 buffer 3 Uninitializing buffers for device 0 context 8 Freeing CAL GPU memory, device 0 context 8 buffer 2 Freeing CAL GPU memory, device 0 context 8 buffer 3 Uninitializing buffers for device 0 context 9 Freeing CAL GPU memory, device 0 context 9 buffer 2 Freeing CAL GPU memory, device 0 context 9 buffer 3 Uninitializing buffers for device 0 context 10 Freeing CAL GPU memory, device 0 context 10 buffer 2 Freeing CAL GPU memory, device 0 context 10 buffer 3 Uninitializing buffers for device 0 context 11 Freeing CAL GPU memory, device 0 context 11 buffer 2 Freeing CAL GPU memory, device 0 context 11 buffer 3 Uninitializing buffers for device 0 context 12 Freeing CAL GPU memory, device 0 context 12 buffer 2 Freeing CAL GPU memory, device 0 context 12 buffer 3 Uninitializing buffers for device 0 context 13 Freeing CAL GPU memory, device 0 context 13 buffer 2 Freeing CAL GPU memory, device 0 context 13 buffer 3 Uninitializing buffers for device 0 context 14 Freeing CAL GPU memory, device 0 context 14 buffer 2 Freeing CAL GPU memory, device 0 context 14 buffer 3 Uninitializing buffers for device 0 context 15 Freeing CAL GPU memory, device 0 context 15 buffer 2 Freeing CAL GPU memory, device 0 context 15 buffer 3 Uninitializing buffers for device 0 context 16 Freeing CAL GPU memory, device 0 context 16 buffer 2 Freeing CAL GPU memory, device 0 context 16 buffer 3 Uninitializing buffers for device 0 context 17 Freeing CAL GPU memory, device 0 context 17 buffer 2 Freeing CAL GPU memory, device 0 context 17 buffer 3 Uninitializing buffers for device 0 context 18 Freeing CAL GPU memory, device 0 context 18 buffer 2 Freeing CAL GPU memory, device 0 context 18 buffer 3 Uninitializing buffers for device 0 context 19 Freeing CAL GPU memory, device 0 context 19 buffer 2 Freeing CAL GPU memory, device 0 context 19 buffer 3 Uninitializing buffers for device 0 context 20 Freeing CAL GPU memory, device 0 context 20 buffer 2 Freeing CAL GPU memory, device 0 context 20 buffer 3 Uninitializing context for device 0 Uninitializing CAL runtime rolly@rolly-X8DTG-QF:~/caldgemm$

                              • caldgemm with HD6990s
                                rollyng

                                Here is the single GPU run on one of these HD6990s.

                                rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Cannot use multiple devices without multithreading Initializing CALDGEMM for 1 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Was able to allocate 21 bbuffers Using 8 CPU cores at 1600 MHz, 1 GPUs of 1536 shaders at 830 MHz Caldgemm Init complete, setting CPU mask 80 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2ad1b34c6010, B=0x2ad1b5507010, C=0x2ad1b7518010, (C-A=8430592, (C-B)/w=4104)) Using Kernel 2 (alpha=0xBFF0000000000000 (-1.000), width = 1024) Caldgemm Main Thread, setting CPU mask 1 Initiliazing GPU Constant Buffers...0 Done GPU Curve Ration: 0.70, CPUScale 0.18, GPUScale 1.17 GPURatio automatically set to 0.94 Favoring m direction, 1 blocks Iteration k = 0, m = 0, n = 0 (device 0 obuffer 0) Running Preprocessing device = 0 k = 0 Dividing Buffer A (device = 0, k = 0, buffer = 0) SRC=0x2ad1b34c6010, w: 1024, h: 4096, pitch: 1032 (gpuw: 1024, gpuh: 4096, transpose: 0) Dividing Buffer B (device = 0, k = 0, buffer = 0) SRC=0x2ad1b5507010, w: 1024, h: 4096, pitch: 4104 (gpuw: 1024, gpuh: 4096, transpose: 1) Copying part of A to GPU (k = 0, m = 0, n = 0) Starting conversion kernel Total Kernel Time: 0.0006 Copying part of B to GPU (k = 0, m = 0, n = 0) Starting conversion kernel Total Kernel Time: 0.0194 Waiting for event from device 0 obuffer 0... Executing MM kernel (device 0 obuffer 0, k=0 m=0 n=0) Total Kernel Time: 0.6100 Processing Output (Iteration 1) for device 0 tile 0 (m = 0, n = 0) Waiting for event from device 0 obuffer 0... Merging buffer (device 0, obuffer 0, k = 0, main thread) Main thread unlocking obuffer mutex devuce 0 obuffer 0 Processing Output (Iteration 2) for device 0 tile 1 (m = 1, n = 0) Waiting for event from device 0 obuffer 1... Caldgemm Main Thread, setting CPU mask 80 DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.731 System Gflops 47.081 Times: Kernel Divide (1,1) Merge Copy To Copy From 0.6100 (56.2969 Gflops) 0.0191 (3.5123 GB/s) 0.0696 (1.9273 GB/s) 0.0301 (2.2296 GB/s) 0.0000 (0.0000 Gb/s) Uninitializing CALDGEMM Uninitializing buffers for device 0 context 0 Freeing CAL Host memory, device 0 context 0 buffer 0 Freeing temporary CAL memory, device 0 context 0 buffer 0 Freeing CAL Host memory, device 0 context 0 buffer 1 Freeing temporary CAL memory, device 0 context 0 buffer 1 Freeing CAL Host memory, device 0 context 0 buffer 2 Freeing temporary CAL memory, device 0 context 0 buffer 2 Freeing CAL Host memory, device 0 context 0 buffer 3 Freeing temporary CAL memory, device 0 context 0 buffer 3 Freeing CAL Host memory, device 0 context 0 buffer 4 Freeing CAL Host memory, device 0 context 0 buffer 5 Freeing CAL Host memory, device 0 context 0 buffer 6 Freeing CAL Host memory, device 0 context 0 buffer 7 Freeing CAL Host memory, device 0 context 0 buffer 8 Freeing CAL Host memory, device 0 context 0 buffer 9 Freeing CAL Host memory, device 0 context 0 buffer 10 Freeing CAL Host memory, device 0 context 0 buffer 11 Freeing CAL Host memory, device 0 context 0 buffer 12 Freeing CAL GPU memory, device 0 context 0 buffer 0 Freeing CAL GPU memory, device 0 context 0 buffer 1 Freeing CAL GPU memory, device 0 context 0 buffer 2 Freeing CAL GPU memory, device 0 context 0 buffer 3 Freeing CAL GPU memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 5 Freeing CAL GPU memory, device 0 context 0 buffer 6 Freeing CAL GPU memory, device 0 context 0 buffer 7 Freeing CAL GPU memory, device 0 context 0 buffer 8 Freeing CAL GPU memory, device 0 context 0 buffer 9 Freeing CAL GPU memory, device 0 context 0 buffer 10 Freeing CAL GPU memory, device 0 context 0 buffer 11 Freeing CAL GPU memory, device 0 context 0 buffer 12 Uninitializing buffers for device 0 context 1 Freeing CAL Host memory, device 0 context 1 buffer 0 Freeing temporary CAL memory, device 0 context 1 buffer 0 Freeing CAL Host memory, device 0 context 1 buffer 1 Freeing temporary CAL memory, device 0 context 1 buffer 1 Freeing CAL Host memory, device 0 context 1 buffer 2 Freeing temporary CAL memory, device 0 context 1 buffer 2 Freeing CAL Host memory, device 0 context 1 buffer 3 Freeing temporary CAL memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 0 Freeing CAL GPU memory, device 0 context 1 buffer 1 Freeing CAL GPU memory, device 0 context 1 buffer 2 Freeing CAL GPU memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 5 Freeing CAL GPU memory, device 0 context 1 buffer 6 Freeing CAL GPU memory, device 0 context 1 buffer 7 Freeing CAL GPU memory, device 0 context 1 buffer 8 Freeing CAL GPU memory, device 0 context 1 buffer 9 Freeing CAL GPU memory, device 0 context 1 buffer 10 Freeing CAL GPU memory, device 0 context 1 buffer 11 Freeing CAL GPU memory, device 0 context 1 buffer 12 Uninitializing buffers for device 0 context 2 Freeing CAL GPU memory, device 0 context 2 buffer 2 Freeing CAL GPU memory, device 0 context 2 buffer 3 Freeing CAL GPU memory, device 0 context 2 buffer 5 Freeing CAL GPU memory, device 0 context 2 buffer 6 Freeing CAL GPU memory, device 0 context 2 buffer 7 Freeing CAL GPU memory, device 0 context 2 buffer 8 Freeing CAL GPU memory, device 0 context 2 buffer 9 Freeing CAL GPU memory, device 0 context 2 buffer 10 Freeing CAL GPU memory, device 0 context 2 buffer 11 Freeing CAL GPU memory, device 0 context 2 buffer 12 Uninitializing buffers for device 0 context 3 Freeing CAL GPU memory, device 0 context 3 buffer 2 Freeing CAL GPU memory, device 0 context 3 buffer 3 Uninitializing buffers for device 0 context 4 Freeing CAL GPU memory, device 0 context 4 buffer 2 Freeing CAL GPU memory, device 0 context 4 buffer 3 Uninitializing buffers for device 0 context 5 Freeing CAL GPU memory, device 0 context 5 buffer 2 Freeing CAL GPU memory, device 0 context 5 buffer 3 Uninitializing buffers for device 0 context 6 Freeing CAL GPU memory, device 0 context 6 buffer 2 Freeing CAL GPU memory, device 0 context 6 buffer 3 Uninitializing buffers for device 0 context 7 Freeing CAL GPU memory, device 0 context 7 buffer 2 Freeing CAL GPU memory, device 0 context 7 buffer 3 Uninitializing buffers for device 0 context 8 Freeing CAL GPU memory, device 0 context 8 buffer 2 Freeing CAL GPU memory, device 0 context 8 buffer 3 Uninitializing buffers for device 0 context 9 Freeing CAL GPU memory, device 0 context 9 buffer 2 Freeing CAL GPU memory, device 0 context 9 buffer 3 Uninitializing buffers for device 0 context 10 Freeing CAL GPU memory, device 0 context 10 buffer 2 Freeing CAL GPU memory, device 0 context 10 buffer 3 Uninitializing buffers for device 0 context 11 Freeing CAL GPU memory, device 0 context 11 buffer 2 Freeing CAL GPU memory, device 0 context 11 buffer 3 Uninitializing buffers for device 0 context 12 Freeing CAL GPU memory, device 0 context 12 buffer 2 Freeing CAL GPU memory, device 0 context 12 buffer 3 Uninitializing buffers for device 0 context 13 Freeing CAL GPU memory, device 0 context 13 buffer 2 Freeing CAL GPU memory, device 0 context 13 buffer 3 Uninitializing buffers for device 0 context 14 Freeing CAL GPU memory, device 0 context 14 buffer 2 Freeing CAL GPU memory, device 0 context 14 buffer 3 Uninitializing buffers for device 0 context 15 Freeing CAL GPU memory, device 0 context 15 buffer 2 Freeing CAL GPU memory, device 0 context 15 buffer 3 Uninitializing buffers for device 0 context 16 Freeing CAL GPU memory, device 0 context 16 buffer 2 Freeing CAL GPU memory, device 0 context 16 buffer 3 Uninitializing buffers for device 0 context 17 Freeing CAL GPU memory, device 0 context 17 buffer 2 Freeing CAL GPU memory, device 0 context 17 buffer 3 Uninitializing buffers for device 0 context 18 Freeing CAL GPU memory, device 0 context 18 buffer 2 Freeing CAL GPU memory, device 0 context 18 buffer 3 Uninitializing buffers for device 0 context 19 Freeing CAL GPU memory, device 0 context 19 buffer 2 Freeing CAL GPU memory, device 0 context 19 buffer 3 Uninitializing buffers for device 0 context 20 Freeing CAL GPU memory, device 0 context 20 buffer 2 Freeing CAL GPU memory, device 0 context 20 buffer 3 Uninitializing context for device 0 Uninitializing CAL runtime rolly@rolly-X8DTG-QF:~/caldgemm$

                                • caldgemm with HD6990s
                                  rollyng

                                  Now I run -z for CPU only

                                  rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c -z -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 8 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating temporary device buffer for device 1 context 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating temporary device buffer for device 1 context 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating temporary device buffer for device 1 context 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating temporary device buffer for device 1 context 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating temporary device buffer for device 1 context 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating temporary device buffer for device 1 context 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating temporary device buffer for device 1 context 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating temporary device buffer for device 1 context 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Allocating Host buffer for device 2 obuffer 0 buffer 0 Allocating device buffer for device 2 obuffer 0 buffer 0 Allocating temporary device buffer for device 2 context 0 buffer 0 Allocating Host buffer for device 2 obuffer 0 buffer 1 Allocating device buffer for device 2 obuffer 0 buffer 1 Allocating temporary device buffer for device 2 context 0 buffer 1 Allocating Host buffer for device 2 obuffer 0 buffer 2 Allocating device buffer for device 2 obuffer 0 buffer 2 Allocating temporary device buffer for device 2 context 0 buffer 2 Allocating Host buffer for device 2 obuffer 0 buffer 3 Allocating device buffer for device 2 obuffer 0 buffer 3 Allocating temporary device buffer for device 2 context 0 buffer 3 Allocating Host memory for device 2 obuffer 0 buffer 4 Allocating device buffer for device 2 obuffer 0 buffer 5 Allocating device buffer for device 2 obuffer 0 buffer 6 Allocating device buffer for device 2 obuffer 0 buffer 7 Allocating device buffer for device 2 obuffer 0 buffer 8 Allocating device buffer for device 2 obuffer 0 buffer 9 Allocating device buffer for device 2 obuffer 0 buffer 10 Allocating device buffer for device 2 obuffer 0 buffer 11 Allocating device buffer for device 2 obuffer 0 buffer 12 Allocating Host Constant buffer device 2 context 0 buffer 4 Getting module buffer name for device 2 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 20 Allocating Host buffer for device 2 obuffer 1 buffer 0 Allocating device buffer for device 2 obuffer 1 buffer 0 Allocating temporary device buffer for device 2 context 1 buffer 0 Allocating Host buffer for device 2 obuffer 1 buffer 1 Allocating device buffer for device 2 obuffer 1 buffer 1 Allocating temporary device buffer for device 2 context 1 buffer 1 Allocating Host buffer for device 2 obuffer 1 buffer 2 Allocating device buffer for device 2 obuffer 1 buffer 2 Allocating temporary device buffer for device 2 context 1 buffer 2 Allocating Host buffer for device 2 obuffer 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 3 Allocating temporary device buffer for device 2 context 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 5 Allocating device buffer for device 2 obuffer 1 buffer 6 Allocating device buffer for device 2 obuffer 1 buffer 7 Allocating device buffer for device 2 obuffer 1 buffer 8 Allocating device buffer for device 2 obuffer 1 buffer 9 Allocating device buffer for device 2 obuffer 1 buffer 10 Allocating device buffer for device 2 obuffer 1 buffer 11 Allocating device buffer for device 2 obuffer 1 buffer 12 There was an error in allocating resources and binding them to memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$

                                  • caldgemm with HD6990s
                                    rollyng

                                    Finally -z for GPUs. To me the -z option does not work at all?

                                    By the way, I just "git clone git://code.compeng.uni-frankfurt.de/caldgemm", am I having the latest version of caldgemm?

                                    Thanks!

                                     

                                     

                                    rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g -z -v -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 8 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating temporary device buffer for device 0 context 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating temporary device buffer for device 0 context 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating temporary device buffer for device 0 context 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating temporary device buffer for device 0 context 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating temporary device buffer for device 0 context 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating temporary device buffer for device 0 context 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating temporary device buffer for device 0 context 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating temporary device buffer for device 0 context 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating temporary device buffer for device 1 context 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating temporary device buffer for device 1 context 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating temporary device buffer for device 1 context 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating temporary device buffer for device 1 context 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating temporary device buffer for device 1 context 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating temporary device buffer for device 1 context 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating temporary device buffer for device 1 context 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating temporary device buffer for device 1 context 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Allocating Host buffer for device 2 obuffer 0 buffer 0 Allocating device buffer for device 2 obuffer 0 buffer 0 Allocating temporary device buffer for device 2 context 0 buffer 0 Allocating Host buffer for device 2 obuffer 0 buffer 1 Allocating device buffer for device 2 obuffer 0 buffer 1 Allocating temporary device buffer for device 2 context 0 buffer 1 Allocating Host buffer for device 2 obuffer 0 buffer 2 Allocating device buffer for device 2 obuffer 0 buffer 2 Allocating temporary device buffer for device 2 context 0 buffer 2 Allocating Host buffer for device 2 obuffer 0 buffer 3 Allocating device buffer for device 2 obuffer 0 buffer 3 Allocating temporary device buffer for device 2 context 0 buffer 3 Allocating Host memory for device 2 obuffer 0 buffer 4 Allocating device buffer for device 2 obuffer 0 buffer 5 Allocating device buffer for device 2 obuffer 0 buffer 6 Allocating device buffer for device 2 obuffer 0 buffer 7 Allocating device buffer for device 2 obuffer 0 buffer 8 Allocating device buffer for device 2 obuffer 0 buffer 9 Allocating device buffer for device 2 obuffer 0 buffer 10 Allocating device buffer for device 2 obuffer 0 buffer 11 Allocating device buffer for device 2 obuffer 0 buffer 12 Allocating Host Constant buffer device 2 context 0 buffer 4 Getting module buffer name for device 2 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 20 Allocating Host buffer for device 2 obuffer 1 buffer 0 Allocating device buffer for device 2 obuffer 1 buffer 0 Allocating temporary device buffer for device 2 context 1 buffer 0 Allocating Host buffer for device 2 obuffer 1 buffer 1 Allocating device buffer for device 2 obuffer 1 buffer 1 Allocating temporary device buffer for device 2 context 1 buffer 1 Allocating Host buffer for device 2 obuffer 1 buffer 2 Allocating device buffer for device 2 obuffer 1 buffer 2 Allocating temporary device buffer for device 2 context 1 buffer 2 Allocating Host buffer for device 2 obuffer 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 3 Allocating temporary device buffer for device 2 context 1 buffer 3 Allocating device buffer for device 2 obuffer 1 buffer 5 Allocating device buffer for device 2 obuffer 1 buffer 6 Allocating device buffer for device 2 obuffer 1 buffer 7 Allocating device buffer for device 2 obuffer 1 buffer 8 Allocating device buffer for device 2 obuffer 1 buffer 9 Allocating device buffer for device 2 obuffer 1 buffer 10 Allocating device buffer for device 2 obuffer 1 buffer 11 Allocating device buffer for device 2 obuffer 1 buffer 12 There was an error in allocating resources and binding them to memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$

                                      • caldgemm with HD6990s
                                        drohr

                                        Hi rollyng,

                                        I tried to look into this, I plugged three 6970 GPUs in a node but I cannot reproduce the issue you see.

                                        The log you posted tells me that the AMD runtime is unable to allocate host memory, i.e. I issue a malloc call for a page locked buffer but get an error message.

                                        Could you plase update to the current git revision or apply the attached patch. The debug message will then provide the error code of the API which is needed to analyze this further.

                                        As you said your system only has 4GB of memory you might be running out of page locked memory.

                                        you can try to use two GPUs and see whether that works with: ./dgemm_bench -z -v -d -Y 2

                                         

                                        Regards

                                        --- a/caldgemm.cpp +++ b/caldgemm.cpp @@ -3383,7 +3383,7 @@ int caldgemm::SetupData(CALmodule *module, CALresource* &_Res, BufferProperties* calResFree(_Res[j]); } - if (nContext < obuffercount) fprintf(STD_OUT, "There was an error in allocating resources and binding them to memory\n"); + if (nContext < obuffercount) fprintf(STD_OUT, "There was an error in allocating resources and binding them to memory (Error code %d)\n", r); else if (Config->Debug) fprintf(STD_OUT, "No more memory available for bbuffers\n"); return(1); }

                                          • caldgemm with HD6990s
                                            rollyng

                                            HI, I recompiles the lastest with git pull,

                                            with -c -z -d now it gives output:

                                             

                                            rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -c -z -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Was able to allocate 21 bbuffers Waiting for cblas slave to start Cblas helper thread started Cblas thread Thread, setting CPU mask 80 Waiting for linpack slave to start Using 8 CPU cores at 1600 MHz, 0 GPUs of 0 shaders at 0 MHz Caldgemm Init complete, setting CPU mask 80 Linpack helper thread started Linpack Thread, setting CPU mask 8 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2aedb2bae010, B=0x2aedb4bef010, C=0x2aedb6c00010, (C-A=8430592, (C-B)/w=4104)) Running CPU only DGEMM DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.558 System Gflops 61.684 Uninitializing CALDGEMM Uninitializing CAL runtime Trying to terminate linpack slave Waiting for linpack slave to terminate Waiting for merge threads to terminate linpack slave terminating rolly@rolly-X8DTG-QF:~/caldgemm$

                                            • caldgemm with HD6990s
                                              rollyng

                                              Now with -g -z -d still ends with error:

                                               

                                              rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -g -z -d Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 8 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Clearing Memory at 0x2b94b3c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Clearing Memory at 0x2b94b4c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Clearing Memory at 0x2b94b5c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Clearing Memory at 0x2b94b6c95000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Clearing Memory at 0x3c43120, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Clearing Memory at 0x2b94b7e96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Clearing Memory at 0x2b94b8e96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Clearing Memory at 0x2b94b9e96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Clearing Memory at 0x2b94bae96000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Clearing Memory at 0x2b94bc097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Clearing Memory at 0x2b94bd097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Clearing Memory at 0x2b94be097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Clearing Memory at 0x2b94bf097000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Clearing Memory at 0x3c70ff0, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Clearing Memory at 0x2b94c0298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Clearing Memory at 0x2b94c1298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Clearing Memory at 0x2b94c2298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Clearing Memory at 0x2b94c3298000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Allocating Host buffer for device 2 obuffer 0 buffer 0 Clearing Memory at 0x2b94c4499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 0 Allocating Host buffer for device 2 obuffer 0 buffer 1 Clearing Memory at 0x2b94c5499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 1 Allocating Host buffer for device 2 obuffer 0 buffer 2 Clearing Memory at 0x2b94c6499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 2 Allocating Host buffer for device 2 obuffer 0 buffer 3 Clearing Memory at 0x2b94c7499000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 3 Allocating Host memory for device 2 obuffer 0 buffer 4 Clearing Memory at 0x3c9e810, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 2 obuffer 0 buffer 5 Allocating device buffer for device 2 obuffer 0 buffer 6 Allocating device buffer for device 2 obuffer 0 buffer 7 Allocating device buffer for device 2 obuffer 0 buffer 8 Allocating device buffer for device 2 obuffer 0 buffer 9 Allocating device buffer for device 2 obuffer 0 buffer 10 Allocating device buffer for device 2 obuffer 0 buffer 11 Allocating device buffer for device 2 obuffer 0 buffer 12 Allocating Host Constant buffer device 2 context 0 buffer 4 Getting module buffer name for device 2 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 2 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 2 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 2 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 2 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 2 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 2 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 2 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 2 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 2 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 2 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 2 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 2 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 2 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 20 Allocating Host buffer for device 2 obuffer 1 buffer 0 Clearing Memory at 0x2b94c869a000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 1 buffer 0 Allocating Host buffer for device 2 obuffer 1 buffer 1 Clearing Memory at 0x2b94c969a000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 2 obuffer 1 buffer 1 Allocating Host buffer for device 2 obuffer 1 buffer 2 Error 'Operational error' while allocattion of remote memory Error initializing CALDGEMM rolly@rolly-X8DTG-QF:~/caldgemm$

                                              • caldgemm with HD6990s
                                                rollyng

                                                With 2 GPUs it finishes! So does it mean the current ver. of caldgemm cannot run on 4x 6990s (8 GPUs)? Thanks!

                                                 

                                                rolly@rolly-X8DTG-QF:~/caldgemm$ ./dgemm_bench -z -v -d -Y 2 Use -? for help Init Caldgemm, setting CPU mask 1 CAL Runtime Version:1.4.1385 Initializing CAL Initializing CALDGEMM for 2 devices Allocating Host buffer for device 0 obuffer 0 buffer 0 Clearing Memory at 0x2b40865a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 0 Allocating Host buffer for device 0 obuffer 0 buffer 1 Clearing Memory at 0x2b40875a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 1 Allocating Host buffer for device 0 obuffer 0 buffer 2 Clearing Memory at 0x2b40885a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 2 Allocating Host buffer for device 0 obuffer 0 buffer 3 Clearing Memory at 0x2b40895a8000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 3 Allocating Host memory for device 0 obuffer 0 buffer 4 Clearing Memory at 0x3105020, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 0 obuffer 0 buffer 5 Allocating device buffer for device 0 obuffer 0 buffer 6 Allocating device buffer for device 0 obuffer 0 buffer 7 Allocating device buffer for device 0 obuffer 0 buffer 8 Allocating device buffer for device 0 obuffer 0 buffer 9 Allocating device buffer for device 0 obuffer 0 buffer 10 Allocating device buffer for device 0 obuffer 0 buffer 11 Allocating device buffer for device 0 obuffer 0 buffer 12 Allocating Host Constant buffer device 0 context 0 buffer 4 Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 2 Allocating Host buffer for device 0 obuffer 1 buffer 0 Clearing Memory at 0x2b408a7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 0 Allocating Host buffer for device 0 obuffer 1 buffer 1 Clearing Memory at 0x2b408b7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 1 Allocating Host buffer for device 0 obuffer 1 buffer 2 Clearing Memory at 0x2b408c7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 2 Allocating Host buffer for device 0 obuffer 1 buffer 3 Clearing Memory at 0x2b408d7a9000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 0 obuffer 1 buffer 3 Allocating device buffer for device 0 obuffer 1 buffer 5 Allocating device buffer for device 0 obuffer 1 buffer 6 Allocating device buffer for device 0 obuffer 1 buffer 7 Allocating device buffer for device 0 obuffer 1 buffer 8 Allocating device buffer for device 0 obuffer 1 buffer 9 Allocating device buffer for device 0 obuffer 1 buffer 10 Allocating device buffer for device 0 obuffer 1 buffer 11 Allocating device buffer for device 0 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 4 Allocating device buffer for device 0 obuffer 2 buffer 2 Allocating device buffer for device 0 obuffer 2 buffer 3 Allocating device buffer for device 0 obuffer 2 buffer 5 Allocating device buffer for device 0 obuffer 2 buffer 6 Allocating device buffer for device 0 obuffer 2 buffer 7 Allocating device buffer for device 0 obuffer 2 buffer 8 Allocating device buffer for device 0 obuffer 2 buffer 9 Allocating device buffer for device 0 obuffer 2 buffer 10 Allocating device buffer for device 0 obuffer 2 buffer 11 Allocating device buffer for device 0 obuffer 2 buffer 12 Allocating device buffer for device 0 obuffer 3 buffer 2 Allocating device buffer for device 0 obuffer 3 buffer 3 Allocating device buffer for device 0 obuffer 4 buffer 2 Allocating device buffer for device 0 obuffer 4 buffer 3 Allocating device buffer for device 0 obuffer 5 buffer 2 Allocating device buffer for device 0 obuffer 5 buffer 3 Allocating device buffer for device 0 obuffer 6 buffer 2 Allocating device buffer for device 0 obuffer 6 buffer 3 Allocating device buffer for device 0 obuffer 7 buffer 2 Allocating device buffer for device 0 obuffer 7 buffer 3 Allocating device buffer for device 0 obuffer 8 buffer 2 Allocating device buffer for device 0 obuffer 8 buffer 3 Allocating device buffer for device 0 obuffer 9 buffer 2 Allocating device buffer for device 0 obuffer 9 buffer 3 Allocating device buffer for device 0 obuffer 10 buffer 2 Allocating device buffer for device 0 obuffer 10 buffer 3 Allocating device buffer for device 0 obuffer 11 buffer 2 Allocating device buffer for device 0 obuffer 11 buffer 3 Allocating device buffer for device 0 obuffer 12 buffer 2 Allocating device buffer for device 0 obuffer 12 buffer 3 Allocating device buffer for device 0 obuffer 13 buffer 2 Allocating device buffer for device 0 obuffer 13 buffer 3 Allocating device buffer for device 0 obuffer 14 buffer 2 Allocating device buffer for device 0 obuffer 14 buffer 3 Allocating device buffer for device 0 obuffer 15 buffer 2 Allocating device buffer for device 0 obuffer 15 buffer 3 Allocating device buffer for device 0 obuffer 16 buffer 2 Allocating device buffer for device 0 obuffer 16 buffer 3 Allocating device buffer for device 0 obuffer 17 buffer 2 Allocating device buffer for device 0 obuffer 17 buffer 3 Allocating device buffer for device 0 obuffer 18 buffer 2 Allocating device buffer for device 0 obuffer 18 buffer 3 Allocating device buffer for device 0 obuffer 19 buffer 2 Allocating device buffer for device 0 obuffer 19 buffer 3 Allocating device buffer for device 0 obuffer 20 buffer 2 Allocating device buffer for device 0 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 0 Allocating Host buffer for device 1 obuffer 0 buffer 0 Clearing Memory at 0x2b408e9aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 0 Allocating Host buffer for device 1 obuffer 0 buffer 1 Clearing Memory at 0x2b408f9aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 1 Allocating Host buffer for device 1 obuffer 0 buffer 2 Clearing Memory at 0x2b40909aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 2 Allocating Host buffer for device 1 obuffer 0 buffer 3 Clearing Memory at 0x2b40919aa000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 3 Allocating Host memory for device 1 obuffer 0 buffer 4 Clearing Memory at 0x3132ef0, Width = 8, Height = 1, components = 2, type=double Allocating device buffer for device 1 obuffer 0 buffer 5 Allocating device buffer for device 1 obuffer 0 buffer 6 Allocating device buffer for device 1 obuffer 0 buffer 7 Allocating device buffer for device 1 obuffer 0 buffer 8 Allocating device buffer for device 1 obuffer 0 buffer 9 Allocating device buffer for device 1 obuffer 0 buffer 10 Allocating device buffer for device 1 obuffer 0 buffer 11 Allocating device buffer for device 1 obuffer 0 buffer 12 Allocating Host Constant buffer device 1 context 0 buffer 4 Getting module buffer name for device 1 context 0 kernel 0 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 0 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 0 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 0 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 0 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 0 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 0 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 0 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 0 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 0 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 0 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 0 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 0 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 1 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 1 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 1 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 1 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 1 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 1 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 1 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 1 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 1 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 1 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 1 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 1 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 1 buffer 12 name o7 Getting module buffer name for device 1 context 0 kernel 2 buffer 0 name i0 Getting module buffer name for device 1 context 0 kernel 2 buffer 1 name i1 Getting module buffer name for device 1 context 0 kernel 2 buffer 2 name i2 Getting module buffer name for device 1 context 0 kernel 2 buffer 3 name i3 Getting module buffer name for device 1 context 0 kernel 2 buffer 4 name cb0 Getting module buffer name for device 1 context 0 kernel 2 buffer 5 name o0 Getting module buffer name for device 1 context 0 kernel 2 buffer 6 name o1 Getting module buffer name for device 1 context 0 kernel 2 buffer 7 name o2 Getting module buffer name for device 1 context 0 kernel 2 buffer 8 name o3 Getting module buffer name for device 1 context 0 kernel 2 buffer 9 name o4 Getting module buffer name for device 1 context 0 kernel 2 buffer 10 name o5 Getting module buffer name for device 1 context 0 kernel 2 buffer 11 name o6 Getting module buffer name for device 1 context 0 kernel 2 buffer 12 name o7 Merger Thread 0 started Merge Thread 0, setting CPU mask 8 Allocating Host buffer for device 1 obuffer 1 buffer 0 Clearing Memory at 0x2b4092bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 0 Allocating Host buffer for device 1 obuffer 1 buffer 1 Clearing Memory at 0x2b4093bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 1 Allocating Host buffer for device 1 obuffer 1 buffer 2 Clearing Memory at 0x2b4094bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 2 Allocating Host buffer for device 1 obuffer 1 buffer 3 Clearing Memory at 0x2b4095bab000, Width = 1024, Height = 1024, components = 2, type=double Allocating device buffer for device 1 obuffer 1 buffer 3 Allocating device buffer for device 1 obuffer 1 buffer 5 Allocating device buffer for device 1 obuffer 1 buffer 6 Allocating device buffer for device 1 obuffer 1 buffer 7 Allocating device buffer for device 1 obuffer 1 buffer 8 Allocating device buffer for device 1 obuffer 1 buffer 9 Allocating device buffer for device 1 obuffer 1 buffer 10 Allocating device buffer for device 1 obuffer 1 buffer 11 Allocating device buffer for device 1 obuffer 1 buffer 12 Merger Thread 1 started Merge Thread 1, setting CPU mask 10 Allocating device buffer for device 1 obuffer 2 buffer 2 Allocating device buffer for device 1 obuffer 2 buffer 3 Allocating device buffer for device 1 obuffer 2 buffer 5 Allocating device buffer for device 1 obuffer 2 buffer 6 Allocating device buffer for device 1 obuffer 2 buffer 7 Allocating device buffer for device 1 obuffer 2 buffer 8 Allocating device buffer for device 1 obuffer 2 buffer 9 Allocating device buffer for device 1 obuffer 2 buffer 10 Allocating device buffer for device 1 obuffer 2 buffer 11 Allocating device buffer for device 1 obuffer 2 buffer 12 Allocating device buffer for device 1 obuffer 3 buffer 2 Allocating device buffer for device 1 obuffer 3 buffer 3 Allocating device buffer for device 1 obuffer 4 buffer 2 Allocating device buffer for device 1 obuffer 4 buffer 3 Allocating device buffer for device 1 obuffer 5 buffer 2 Allocating device buffer for device 1 obuffer 5 buffer 3 Allocating device buffer for device 1 obuffer 6 buffer 2 Allocating device buffer for device 1 obuffer 6 buffer 3 Allocating device buffer for device 1 obuffer 7 buffer 2 Allocating device buffer for device 1 obuffer 7 buffer 3 Allocating device buffer for device 1 obuffer 8 buffer 2 Allocating device buffer for device 1 obuffer 8 buffer 3 Allocating device buffer for device 1 obuffer 9 buffer 2 Allocating device buffer for device 1 obuffer 9 buffer 3 Allocating device buffer for device 1 obuffer 10 buffer 2 Allocating device buffer for device 1 obuffer 10 buffer 3 Allocating device buffer for device 1 obuffer 11 buffer 2 Allocating device buffer for device 1 obuffer 11 buffer 3 Allocating device buffer for device 1 obuffer 12 buffer 2 Allocating device buffer for device 1 obuffer 12 buffer 3 Allocating device buffer for device 1 obuffer 13 buffer 2 Allocating device buffer for device 1 obuffer 13 buffer 3 Allocating device buffer for device 1 obuffer 14 buffer 2 Allocating device buffer for device 1 obuffer 14 buffer 3 Allocating device buffer for device 1 obuffer 15 buffer 2 Allocating device buffer for device 1 obuffer 15 buffer 3 Allocating device buffer for device 1 obuffer 16 buffer 2 Allocating device buffer for device 1 obuffer 16 buffer 3 Allocating device buffer for device 1 obuffer 17 buffer 2 Allocating device buffer for device 1 obuffer 17 buffer 3 Allocating device buffer for device 1 obuffer 18 buffer 2 Allocating device buffer for device 1 obuffer 18 buffer 3 Allocating device buffer for device 1 obuffer 19 buffer 2 Allocating device buffer for device 1 obuffer 19 buffer 3 Allocating device buffer for device 1 obuffer 20 buffer 2 Allocating device buffer for device 1 obuffer 20 buffer 3 Was able to allocate 21 bbuffers on device 1 Was able to allocate 21 bbuffers Waiting for linpack slave to start Using 8 CPU cores at 1600 MHz, 2 GPUs of 1536 shaders at 830 MHz Caldgemm Init complete, setting CPU mask 80 Linpack helper thread started Linpack Thread, setting CPU mask 20 Initializing Data... ...alloc A...alloc B...alloc C...init A...init BUser Data Initialized ...Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b4096fad010, B=0x2b4098fee010, C=0x2b409afff010, (C-A=8430592, (C-B)/w=4104)) Using Kernel 2 (alpha=0xBFF0000000000000 (-1.000), width = 1024) Caldgemm Main Thread, setting CPU mask 1 Initiliazing GPU Constant Buffers...01 Done GPU Curve Ration: 0.70, CPUScale 0.12, GPUScale 2.34 GPURatio automatically set to 0.98 Favoring m direction, 1 blocks Iteration k = 0, m = 0, n = 0 (device 0 obuffer 0) Running Preprocessing device = 0 k = 0 Dividing Buffer A (device = 0, k = 0, buffer = 0) SRC=0x2b4096fad010, w: 1024, h: 4096, pitch: 1032 (gpuw: 1024, gpuh: 4096, transpose: 0) Dividing Buffer B (device = 0, k = 0, buffer = 0) SRC=0x2b4098fee010, w: 1024, h: 4096, pitch: 4104 (gpuw: 1024, gpuh: 4096, transpose: 1) Copying part of A to GPU (k = 0, m = 0, n = 0) Copying part of B to GPU (k = 0, m = 0, n = 0) Locking obuffer mutex 0/0 Waiting for event from device 0 obuffer 0... Executing MM kernel (device 0 obuffer 0, k=0 m=0 n=0) Total Kernel Time: 0.5996 Processing Output (Iteration 2) for device 0 tile 0 (m = 0, n = 0) Waiting for event from device 0 obuffer 0... Unlocking outputthread mutex 0 to process device 0 obuffer 0 Processing Output (Iteration 3) for device 1 tile 1 (m = 1, n = 0) Waiting for event from device 1 obuffer 0... Processing Output (Iteration 4) for device 0 tile 2 (m = 2, n = 0) Waiting for event from device 0 obuffer 1... Waiting to finish merge process for device 0 obuffer 0 Slave thread 0 (device 0) starting merge process for obuffer 0 (k = 0) Merge time: 0.080 Unlocking mutex device 0 obuffer 0 (Slavethread 0) Waiting to finish merge process for device 1 obuffer 0 Waiting to finish merge process for device 1 obuffer 1 Waiting to finish merge process for device 0 obuffer 2 Waiting to finish merge process for device 1 obuffer 2 Caldgemm Main Thread, setting CPU mask 80 DGEMM Run Complete Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-X8DTG-QF) System Time 0.733 System Gflops 46.938 Times: Kernel Divide (1,1) Merge Copy To Copy From 0.5996 (57.2786 Gflops) 0.0213 (3.1549 GB/s) 0.0803 (0.0000 GB/s) 0.0309 (2.1699 GB/s) 0.0000 (0.0000 Gb/s) Uninitializing CALDGEMM Uninitializing buffers for device 0 context 0 Freeing CAL Host memory, device 0 context 0 buffer 0 Freeing CAL Host memory, device 0 context 0 buffer 1 Freeing CAL Host memory, device 0 context 0 buffer 2 Freeing CAL Host memory, device 0 context 0 buffer 3 Freeing CAL Host memory, device 0 context 0 buffer 4 Freeing CAL Host memory, device 0 context 0 buffer 5 Freeing CAL Host memory, device 0 context 0 buffer 6 Freeing CAL Host memory, device 0 context 0 buffer 7 Freeing CAL Host memory, device 0 context 0 buffer 8 Freeing CAL Host memory, device 0 context 0 buffer 9 Freeing CAL Host memory, device 0 context 0 buffer 10 Freeing CAL Host memory, device 0 context 0 buffer 11 Freeing CAL Host memory, device 0 context 0 buffer 12 Freeing CAL GPU memory, device 0 context 0 buffer 0 Freeing CAL GPU memory, device 0 context 0 buffer 1 Freeing CAL GPU memory, device 0 context 0 buffer 2 Freeing CAL GPU memory, device 0 context 0 buffer 3 Freeing CAL GPU memory, device 0 context 0 buffer 4 Freeing CAL GPU memory, device 0 context 0 buffer 5 Freeing CAL GPU memory, device 0 context 0 buffer 6 Freeing CAL GPU memory, device 0 context 0 buffer 7 Freeing CAL GPU memory, device 0 context 0 buffer 8 Freeing CAL GPU memory, device 0 context 0 buffer 9 Freeing CAL GPU memory, device 0 context 0 buffer 10 Freeing CAL GPU memory, device 0 context 0 buffer 11 Freeing CAL GPU memory, device 0 context 0 buffer 12 Trying to terminate merge slave 0 Uninitializing buffers for device 0 context 1 Freeing CAL Host memory, device 0 context 1 buffer 0 merge slave 0 terminating Freeing CAL Host memory, device 0 context 1 buffer 1 Freeing CAL Host memory, device 0 context 1 buffer 2 Freeing CAL Host memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 0 Freeing CAL GPU memory, device 0 context 1 buffer 1 Freeing CAL GPU memory, device 0 context 1 buffer 2 Freeing CAL GPU memory, device 0 context 1 buffer 3 Freeing CAL GPU memory, device 0 context 1 buffer 5 Freeing CAL GPU memory, device 0 context 1 buffer 6 Freeing CAL GPU memory, device 0 context 1 buffer 7 Freeing CAL GPU memory, device 0 context 1 buffer 8 Freeing CAL GPU memory, device 0 context 1 buffer 9 Freeing CAL GPU memory, device 0 context 1 buffer 10 Freeing CAL GPU memory, device 0 context 1 buffer 11 Freeing CAL GPU memory, device 0 context 1 buffer 12 Trying to terminate merge slave 1 Uninitializing buffers for device 0 context 2 Freeing CAL GPU memory, device 0 context 2 buffer 2 merge slave 1 terminating Freeing CAL GPU memory, device 0 context 2 buffer 3 Freeing CAL GPU memory, device 0 context 2 buffer 5 Freeing CAL GPU memory, device 0 context 2 buffer 6 Freeing CAL GPU memory, device 0 context 2 buffer 7 Freeing CAL GPU memory, device 0 context 2 buffer 8 Freeing CAL GPU memory, device 0 context 2 buffer 9 Freeing CAL GPU memory, device 0 context 2 buffer 10 Freeing CAL GPU memory, device 0 context 2 buffer 11 Freeing CAL GPU memory, device 0 context 2 buffer 12 Uninitializing buffers for device 0 context 3 Freeing CAL GPU memory, device 0 context 3 buffer 2 Freeing CAL GPU memory, device 0 context 3 buffer 3 Uninitializing buffers for device 0 context 4 Freeing CAL GPU memory, device 0 context 4 buffer 2 Freeing CAL GPU memory, device 0 context 4 buffer 3 Uninitializing buffers for device 0 context 5 Freeing CAL GPU memory, device 0 context 5 buffer 2 Freeing CAL GPU memory, device 0 context 5 buffer 3 Uninitializing buffers for device 0 context 6 Freeing CAL GPU memory, device 0 context 6 buffer 2 Freeing CAL GPU memory, device 0 context 6 buffer 3 Uninitializing buffers for device 0 context 7 Freeing CAL GPU memory, device 0 context 7 buffer 2 Freeing CAL GPU memory, device 0 context 7 buffer 3 Uninitializing buffers for device 0 context 8 Freeing CAL GPU memory, device 0 context 8 buffer 2 Freeing CAL GPU memory, device 0 context 8 buffer 3 Uninitializing buffers for device 0 context 9 Freeing CAL GPU memory, device 0 context 9 buffer 2 Freeing CAL GPU memory, device 0 context 9 buffer 3 Uninitializing buffers for device 0 context 10 Freeing CAL GPU memory, device 0 context 10 buffer 2 Freeing CAL GPU memory, device 0 context 10 buffer 3 Uninitializing buffers for device 0 context 11 Freeing CAL GPU memory, device 0 context 11 buffer 2 Freeing CAL GPU memory, device 0 context 11 buffer 3 Uninitializing buffers for device 0 context 12 Freeing CAL GPU memory, device 0 context 12 buffer 2 Freeing CAL GPU memory, device 0 context 12 buffer 3 Uninitializing buffers for device 0 context 13 Freeing CAL GPU memory, device 0 context 13 buffer 2 Freeing CAL GPU memory, device 0 context 13 buffer 3 Uninitializing buffers for device 0 context 14 Freeing CAL GPU memory, device 0 context 14 buffer 2 Freeing CAL GPU memory, device 0 context 14 buffer 3 Uninitializing buffers for device 0 context 15 Freeing CAL GPU memory, device 0 context 15 buffer 2 Freeing CAL GPU memory, device 0 context 15 buffer 3 Uninitializing buffers for device 0 context 16 Freeing CAL GPU memory, device 0 context 16 buffer 2 Freeing CAL GPU memory, device 0 context 16 buffer 3 Uninitializing buffers for device 0 context 17 Freeing CAL GPU memory, device 0 context 17 buffer 2 Freeing CAL GPU memory, device 0 context 17 buffer 3 Uninitializing buffers for device 0 context 18 Freeing CAL GPU memory, device 0 context 18 buffer 2 Freeing CAL GPU memory, device 0 context 18 buffer 3 Uninitializing buffers for device 0 context 19 Freeing CAL GPU memory, device 0 context 19 buffer 2 Freeing CAL GPU memory, device 0 context 19 buffer 3 Uninitializing buffers for device 0 context 20 Freeing CAL GPU memory, device 0 context 20 buffer 2 Freeing CAL GPU memory, device 0 context 20 buffer 3 Uninitializing buffers for device 1 context 0 Freeing CAL Host memory, device 1 context 0 buffer 0 Freeing CAL Host memory, device 1 context 0 buffer 1 Freeing CAL Host memory, device 1 context 0 buffer 2 Freeing CAL Host memory, device 1 context 0 buffer 3 Freeing CAL Host memory, device 1 context 0 buffer 4 Freeing CAL GPU memory, device 1 context 0 buffer 0 Freeing CAL GPU memory, device 1 context 0 buffer 1 Freeing CAL GPU memory, device 1 context 0 buffer 2 Freeing CAL GPU memory, device 1 context 0 buffer 3 Freeing CAL GPU memory, device 1 context 0 buffer 4 Freeing CAL GPU memory, device 1 context 0 buffer 5 Freeing CAL GPU memory, device 1 context 0 buffer 6 Freeing CAL GPU memory, device 1 context 0 buffer 7 Freeing CAL GPU memory, device 1 context 0 buffer 8 Freeing CAL GPU memory, device 1 context 0 buffer 9 Freeing CAL GPU memory, device 1 context 0 buffer 10 Freeing CAL GPU memory, device 1 context 0 buffer 11 Freeing CAL GPU memory, device 1 context 0 buffer 12 Trying to terminate merge slave 0 Uninitializing buffers for device 1 context 1 Freeing CAL Host memory, device 1 context 1 buffer 0 merge slave 0 terminating Freeing CAL Host memory, device 1 context 1 buffer 1 Freeing CAL Host memory, device 1 context 1 buffer 2 Freeing CAL Host memory, device 1 context 1 buffer 3 Freeing CAL GPU memory, device 1 context 1 buffer 0 Freeing CAL GPU memory, device 1 context 1 buffer 1 Freeing CAL GPU memory, device 1 context 1 buffer 2 Freeing CAL GPU memory, device 1 context 1 buffer 3 Freeing CAL GPU memory, device 1 context 1 buffer 5 Freeing CAL GPU memory, device 1 context 1 buffer 6 Freeing CAL GPU memory, device 1 context 1 buffer 7 Freeing CAL GPU memory, device 1 context 1 buffer 8 Freeing CAL GPU memory, device 1 context 1 buffer 9 Freeing CAL GPU memory, device 1 context 1 buffer 10 Freeing CAL GPU memory, device 1 context 1 buffer 11 Freeing CAL GPU memory, device 1 context 1 buffer 12 Trying to terminate merge slave 1 Uninitializing buffers for device 1 context 2 Freeing CAL GPU memory, device 1 context 2 buffer 2 merge slave 1 terminating Freeing CAL GPU memory, device 1 context 2 buffer 3 Freeing CAL GPU memory, device 1 context 2 buffer 5 Freeing CAL GPU memory, device 1 context 2 buffer 6 Freeing CAL GPU memory, device 1 context 2 buffer 7 Freeing CAL GPU memory, device 1 context 2 buffer 8 Freeing CAL GPU memory, device 1 context 2 buffer 9 Freeing CAL GPU memory, device 1 context 2 buffer 10 Freeing CAL GPU memory, device 1 context 2 buffer 11 Freeing CAL GPU memory, device 1 context 2 buffer 12 Uninitializing buffers for device 1 context 3 Freeing CAL GPU memory, device 1 context 3 buffer 2 Freeing CAL GPU memory, device 1 context 3 buffer 3 Uninitializing buffers for device 1 context 4 Freeing CAL GPU memory, device 1 context 4 buffer 2 Freeing CAL GPU memory, device 1 context 4 buffer 3 Uninitializing buffers for device 1 context 5 Freeing CAL GPU memory, device 1 context 5 buffer 2 Freeing CAL GPU memory, device 1 context 5 buffer 3 Uninitializing buffers for device 1 context 6 Freeing CAL GPU memory, device 1 context 6 buffer 2 Freeing CAL GPU memory, device 1 context 6 buffer 3 Uninitializing buffers for device 1 context 7 Freeing CAL GPU memory, device 1 context 7 buffer 2 Freeing CAL GPU memory, device 1 context 7 buffer 3 Uninitializing buffers for device 1 context 8 Freeing CAL GPU memory, device 1 context 8 buffer 2 Freeing CAL GPU memory, device 1 context 8 buffer 3 Uninitializing buffers for device 1 context 9 Freeing CAL GPU memory, device 1 context 9 buffer 2 Freeing CAL GPU memory, device 1 context 9 buffer 3 Uninitializing buffers for device 1 context 10 Freeing CAL GPU memory, device 1 context 10 buffer 2 Freeing CAL GPU memory, device 1 context 10 buffer 3 Uninitializing buffers for device 1 context 11 Freeing CAL GPU memory, device 1 context 11 buffer 2 Freeing CAL GPU memory, device 1 context 11 buffer 3 Uninitializing buffers for device 1 context 12 Freeing CAL GPU memory, device 1 context 12 buffer 2 Freeing CAL GPU memory, device 1 context 12 buffer 3 Uninitializing buffers for device 1 context 13 Freeing CAL GPU memory, device 1 context 13 buffer 2 Freeing CAL GPU memory, device 1 context 13 buffer 3 Uninitializing buffers for device 1 context 14 Freeing CAL GPU memory, device 1 context 14 buffer 2 Freeing CAL GPU memory, device 1 context 14 buffer 3 Uninitializing buffers for device 1 context 15 Freeing CAL GPU memory, device 1 context 15 buffer 2 Freeing CAL GPU memory, device 1 context 15 buffer 3 Uninitializing buffers for device 1 context 16 Freeing CAL GPU memory, device 1 context 16 buffer 2 Freeing CAL GPU memory, device 1 context 16 buffer 3 Uninitializing buffers for device 1 context 17 Freeing CAL GPU memory, device 1 context 17 buffer 2 Freeing CAL GPU memory, device 1 context 17 buffer 3 Uninitializing buffers for device 1 context 18 Freeing CAL GPU memory, device 1 context 18 buffer 2 Freeing CAL GPU memory, device 1 context 18 buffer 3 Uninitializing buffers for device 1 context 19 Freeing CAL GPU memory, device 1 context 19 buffer 2 Freeing CAL GPU memory, device 1 context 19 buffer 3 Uninitializing buffers for device 1 context 20 Freeing CAL GPU memory, device 1 context 20 buffer 2 Freeing CAL GPU memory, device 1 context 20 buffer 3 Uninitializing context for device 0 Uninitializing context for device 1 Uninitializing CAL runtime Trying to terminate linpack slave Waiting for linpack slave to terminate Waiting for merge threads to terminate linpack slave terminating rolly@rolly-X8DTG-QF:~/caldgemm$

                                                  • caldgemm with HD6990s
                                                    Marix

                                                    I think caldgemm currently requires 2 to 3 CPU-Cores per GPU (would have to check the source), so yes, on your CPUs it probably won't be able to support more than two 6990s.

                                                    This is to some extend owed to the fact that we currently use Magny-Cours-CPUs  -> Plenty of cores.

                                                      • caldgemm with HD6990s
                                                        laobrasuca

                                                        is there any place where i can see a performance comparison between caldgemm and clAmdBlasDgemm ?

                                                          • caldgemm with HD6990s
                                                            rollyng

                                                             

                                                            Originally posted by: laobrasuca is there any place where i can see a performance comparison between caldgemm and clAmdBlasDgemm ?

                                                             

                                                            Hi, I did some test with acmlgpu1.1.2. as I run the Info.exe, it shows

                                                             

                                                            rolly@rolly-X8DTG-QF:/opt/acmlgpu1.1.2/GPGPUexamples$ ./Info.exe CPUID: function (0) Vendor: GenuineIntel function (1) Family-Model-Stepping: 6-44-2 Feature flags (EDX): BFEBFBFFh Feature flags (ECX): 009EE3FDh MMX (EDX bit 13): yes SSE1 (EDX bit 25): yes SSE2 (EDX bit 26): yes SSE3 (ECX bit 0): yes SSSE3 (ECX bit 9): yes SSE4.1 (ECX bit 19): yes SSE4.2 (ECX bit 20): yes AVX (ECX bit 28): no function (8000_0004) Processor Brand: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz > uname -a Linux rolly-X8DTG-QF 2.6.35-28-generic #50-Ubuntu SMP Fri Mar 18 18:42:20 UTC 2011 x86_64 GNU/Linux > powersave -c sh: powersave: not found CAL RT version: 1.4.1385 CAL CL version: 1.4.1385 gpu0: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu1: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu2: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu3: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu4: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu5: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu6: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported gpu7: Type: CALtarget(15) (unknown type) Revision: 1 Maximum resource 1D width: 16384 Maximum resource 2D width: 16384 Maximum resource 2D height: 16384 Local GPU RAM: 2048 megabytes Uncached remote GPU memory: 1787 megabytes Cached remote GPU memory: 508 megabytes GPU device clock rate: 830 megahertz GPU memory clock rate: 1250 megahertz Wavefront size: 64 Number of SIMDs: 24 Number of shader engines: 2 double precision: Supported local data share: Supported global data share: Supported global GPR: Supported compute shader: Supported memexport: Supported calResCreate pitch alignment: 256 data elements calResCreate address alignment: 256 bytes Unaligned Access Views (UAVs): 12 3D program grid: Supported GPUs found: 8

                                                              • caldgemm with HD6990s
                                                                rollyng

                                                                However, as I run this time_dgemm.exe, it looks like I am hitting the same wall, it just can make use of 3 out of 8 GPUs... but I have 32GB of host memory?

                                                                rolly@rolly-X8DTG-QF:/opt/acmlgpu1.1.2/GPGPUexamples$ ./time_dgemm.exe Matrix Time in Performance Size Seconds in Megaflops ------ ------------ ------------ ERROR: gpu3 - unable to allocate minimum cached system (GART) memory gpu3 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu4 - unable to allocate minimum cached system (GART) memory gpu4 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu5 - unable to allocate minimum cached system (GART) memory gpu5 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu6 - unable to allocate minimum cached system (GART) memory gpu6 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1720 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 463 MB 5242880 ( 5 MB) FAILED ERROR: gpu7 - unable to allocate minimum cached system (GART) memory gpu7 Total Available Last Request Local: 2048 MB 196 MB 1845493760 (1760 MB) ok Remote (NC): 1787 MB 1728 MB 0 ( 0 MB) FAILED Remote (C): 508 MB 472 MB 5242880 ( 5 MB) FAILED WARNING: 5 out of 8 GPUs failed to initialize; proceeding with other(s). 400 2.250818 56 600 0.045632 9467 800 0.049524 20676 1000 0.068471 29209 1200 0.086970 39737 1400 0.109446 50143 1600 0.141187 58022 1800 0.177105 65859 2000 0.206845 77352 2200 0.234911 90655 2400 0.259695 106463 2600 0.290227 121118 2800 0.331030 132628 3000 0.377459 143061 3200 0.361680 181198 3400 0.395542 198735 3600 0.431999 216000 3800 0.467440 234776 4000 0.520821 245765 4200 0.566723 261460 4400 0.618775 275331 4600 0.671366 289963 4800 0.736608 300273 5000 0.801185 312037 5200 0.888577 316479 5400 0.937255 336011 5600 1.007444 348636 5800 1.065766 366144 6000 1.155561 373844 6200 1.212021 393273 6400 1.287458 407227 6600 1.331005 431998 6800 1.375383 457228 7000 1.467186 467561 7200 1.515511 492570 7400 1.626788 498189 7600 1.744087 503387 7800 1.843866 514735 8000 1.918230 533825

                                                                  • caldgemm with HD6990s
                                                                    laobrasuca

                                                                     

                                                                    Hi, I did some test with acmlgpu1.1.2. as I run the Info.exe, it shows

                                                                     

                                                                    hi there, what's the difference between ACML-GPU and clAmdBlas? Would it be that one is CAL and the other OpenCL? And what about performance (at least for single GPU setup)?

                                                                      • caldgemm with HD6990s
                                                                        rollyng

                                                                        Hi, I think you are right. clAmdBlas needs OpenCL but I find that there is only sgemm example for clAmdBlas, so I may not be able to compare dgemm performance of the two libraries?

                                                                          • caldgemm with HD6990s
                                                                            laobrasuca

                                                                            yes, there's only the sgemm example (but be aware that this example has a typo fault - matrix A is written to the bufB - check this post http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=150952&enterthread=y), but, well, they perform exactly the same mathematical operations except for the data type (double instead of float), so I believe you can use the exact same example changing the types only and having a card that supports double precision computations (like the 6790 of yours).

                                                                            if you comprare performance results with caldgemm, please let us know.

                                                                              • caldgemm with HD6990s
                                                                                rollyng

                                                                                OK, let's have the acmlgpu-1.1.2 first for both dgemm and sgemm on single HD6970.

                                                                                rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_dgemm.exe
                                                                                Matrix  Time in       Performance
                                                                                Size    Seconds       in Megaflops
                                                                                ------  ------------  ------------
                                                                                   400      0.758880          168
                                                                                   600      0.030139        14333
                                                                                   800      0.035727        28662
                                                                                  1000      0.049771        40184
                                                                                  1200      0.067966        50849
                                                                                  1400      0.068995        79541
                                                                                  1600      0.086494        94711
                                                                                  1800      0.111565       104549
                                                                                  2000      0.134369       119075
                                                                                  2200      0.161584       131795
                                                                                  2400      0.184988       149458
                                                                                  2600      0.214080       164200
                                                                                  2800      0.241470       181819
                                                                                  3000      0.288899       186916
                                                                                  3200      0.333995       196218
                                                                                  3400      0.415724       189087
                                                                                  3600      0.494598       188662
                                                                                  3800      0.565289       194137
                                                                                  4000      0.662003       193352
                                                                                  4200      0.762732       194270
                                                                                  4400      0.829545       205375
                                                                                  4600      0.866307       224714
                                                                                  4800      0.953990       231851
                                                                                  5000      1.100728       227122
                                                                                  5200      1.219832       230536
                                                                                  5400      1.368855       230066
                                                                                  5600      1.569290       223815
                                                                                  5800      1.789924       218011
                                                                                  6000      2.022892       213555
                                                                                  6200      1.990262       239494
                                                                                  6400      2.092950       250501
                                                                                  6600      2.339395       245786
                                                                                  6800      2.555408       246091
                                                                                  7000      2.725500       251696
                                                                                  7200      2.983605       250199
                                                                                  7400      3.627176       223437
                                                                                  7600      3.460915       253676
                                                                                  7800      3.715710       255430
                                                                                  8000      4.046330       253068


                                                                                rolly@rolly-p5q-pro:~/GPGPUexamples$ ./time_sgemm.exe
                                                                                Matrix  Time in       Performance
                                                                                Size    Seconds       in Megaflops
                                                                                ------  ------------  ------------
                                                                                   400      0.711834          179
                                                                                   600      0.021887        19738
                                                                                   800      0.029878        34273
                                                                                  1000      0.030939        64643
                                                                                  1200      0.035729        96728
                                                                                  1400      0.042776       128295
                                                                                  1600      0.050936       160829
                                                                                  1800      0.061199       190590
                                                                                  2000      0.071055       225176
                                                                                  2200      0.083770       254220
                                                                                  2400      0.092456       299038
                                                                                  2600      0.108912       322755
                                                                                  2800      0.122735       357714
                                                                                  3000      0.132517       407495
                                                                                  3200      0.158700       412954
                                                                                  3400      0.190118       413468
                                                                                  3600      0.211197       441824
                                                                                  3800      0.244340       449143
                                                                                  4000      0.273162       468585
                                                                                  4200      0.352204       420711
                                                                                  4400      0.388507       438519
                                                                                  4600      0.404213       481607
                                                                                  4800      0.450925       490511
                                                                                  5000      0.492675       507434
                                                                                  5200      0.560158       502030
                                                                                  5400      0.652638       482546
                                                                                  5600      0.721320       486929
                                                                                  5800      0.721891       540558
                                                                                  6000      0.828848       521205
                                                                                  6200      0.966798       493025
                                                                                  6400      0.994621       527123
                                                                                  6600      1.134357       506888
                                                                                  6800      1.240944       506762
                                                                                  7000      1.281981       535109
                                                                                  7200      1.296301       575866
                                                                                  7400      1.446126       560427
                                                                                  7600      1.502031       584510
                                                                                  7800      1.653382       574038
                                                                                  8000      1.928930       530864
                                                                                rolly@rolly-p5q-pro:~/GPGPUexamples$

                                                                                  • caldgemm with HD6990s
                                                                                    rollyng

                                                                                    Now the caldgemm,

                                                                                    rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 4096 -n 4096
                                                                                    Use -? for help
                                                                                    Cannot use multiple devices without multithreading
                                                                                    Was able to allocate 21 bbuffers
                                                                                    Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
                                                                                    Doing initial run... Done
                                                                                    Initializing Matrix C
                                                                                    Running Benchmark
                                                                                    Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2b58989a8010, B=0x2b589a9e9010, C=0x2b589c9fa010, (C-A=8430592, (C-B)/w=4104))
                                                                                    Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: rolly-p5q-pro) System Time 0.208 System Gflops 165.602

                                                                                    rolly@rolly-p5q-pro:~/caldgemm$ ./dgemm_bench -m 8192 -n 8192
                                                                                    Use -? for help
                                                                                    Cannot use multiple devices without multithreading
                                                                                    Was able to allocate 21 bbuffers
                                                                                    Initializing Data... ...alloc A...alloc B...alloc C...init A...init B...Done
                                                                                    Doing initial run... Done
                                                                                    Initializing Matrix C
                                                                                    Running Benchmark
                                                                                    Starting DGEMM Run m=8192 k=1024 n=8192 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x2008 LDC=0x2008 At=0 Bt=0 ColMajor=0 (A=0x2b1edc693010, B=0x2b1ee0714010, C=0x2b1ee4725010, (C-A=16851968, (C-B)/w=8200))
                                                                                    Program: caldgemm Sizes - A: 8192x1024 B: 1024x8192 C:8192x8192 (Host: rolly-p5q-pro) System Time 0.581 System Gflops 236.899

                                                                                      • caldgemm with HD6990s
                                                                                        rollyng

                                                                                        What I can conclude so far:

                                                                                        (1) Only 1 HD6990 can only run on these libraries no matter how many extra of these are installed?

                                                                                        (2) acml-gpu looks having batter performance?

                                                                                        Thanks for reading!

                                                                                          • caldgemm with HD6990s
                                                                                            laobrasuca

                                                                                            (1) You mean 1 HD6970, right? Well, it seems to be it, since the performance is roughly 1/5 of the nominal TERAFLOP number in the best of the two libraries. Can anyone confirm this?

                                                                                            (2) There's something I didn't understand. Is the matrices size for acmlgpu-1.1.2 equal to 8000x8000? Because caldgemm do the product for 8192x1024, which is a huge difference. Since the size of the matrix has a visible influence on the Gflops, we can hardly compare these results. Could you re-run them for comparable matrix sizes? However, I don't know if the algorithm is optimized for square matrices or not. 

                                                                                            And, if you have a little more of time, could you test the clAmdBlasDegmm (and maybe clAmdBlasSegmm to compare to time_sgemm.exe)?

                                                                          • caldgemm with HD6990s
                                                                            rollyng

                                                                             

                                                                            Originally posted by: Marix I think caldgemm currently requires 2 to 3 CPU-Cores per GPU (would have to check the source), so yes, on your CPUs it probably won't be able to support more than two 6990s.

                                                                             

                                                                            This is to some extend owed to the fact that we currently use Magny-Cours-CPUs  -> Plenty of cores.

                                                                             

                                                                            Hi Marix, thanks for your clarification, I have 2 E5620 on my system with Hyperthread enabled, so system monitor shows 16 CPUs and I should have 2 CPUs per Cayman GPU. Is this still insufficient for the caldgemm requirement? I believed the max CPU cores per node is 24 with Intel 1366 pin processors, so that makes 3 CPUs per Cayman...

                                                                    • caldgemm with HD6990s
                                                                      Gametimehero

                                                                      I am having similar memory issues.  I am running an HD5870.  I tried following the instructions given on the Wiki.

                                                                      Here are things I ddi not do:

                                                                      Instead of using git, I just downloaded/unzipped the latest version from the Files Page

                                                                      -march=native didn't work so I just deleted it from makefile so that it can compile error-free.

                                                                      I did not use the binary patch for the Catalyst driver.

                                                                       

                                                                      Here is my output given your instructions of ./dgemm_bench -g -z -v -d

                                                                      ./dgemm_bench -g -z -v -d
                                                                      Use -? for help
                                                                      Init Caldgemm, setting CPU mask 1
                                                                      CAL Runtime Version:1.4.1016
                                                                      Initializing CAL
                                                                      Initializing CALDGEMM for 1 devices
                                                                      Allocating Host buffer for device 0 obuffer 0 buffer 0
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 0
                                                                      Allocating Host buffer for device 0 obuffer 0 buffer 1
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 1
                                                                      Allocating Host buffer for device 0 obuffer 0 buffer 2
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 2
                                                                      Allocating Host buffer for device 0 obuffer 0 buffer 3
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 3
                                                                      Allocating Host memory for device 0 obuffer 0 buffer 4
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 5
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 6
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 7
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 8
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 9
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 10
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 11
                                                                      Allocating device buffer for device 0 obuffer 0 buffer 12
                                                                      Allocating Host Constant buffer device 0 context 0 buffer 4
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 0 name i0
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 1 name i1
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 2 name i2
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 3 name i3
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 4 name cb0
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 5 name o0
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 6 name o1
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 7 name o2
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 8 name o3
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 9 name o4
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 10 name o5
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 11 name o6
                                                                      Getting module buffer name for device 0 context 0 kernel 0 buffer 12 name o7
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 0 name i0
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 1 name i1
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 2 name i2
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 3 name i3
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 4 name cb0
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 5 name o0
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 6 name o1
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 7 name o2
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 8 name o3
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 9 name o4
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 10 name o5
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 11 name o6
                                                                      Getting module buffer name for device 0 context 0 kernel 1 buffer 12 name o7
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 0 name i0
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 1 name i1
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 2 name i2
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 3 name i3
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 4 name cb0
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 5 name o0
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 6 name o1
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 7 name o2
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 8 name o3
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 9 name o4
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 10 name o5
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 11 name o6
                                                                      Getting module buffer name for device 0 context 0 kernel 2 buffer 12 name o7
                                                                      Merger Thread 0 started
                                                                      Merge Thread 0, setting CPU mask 2
                                                                      Allocating Host buffer for device 0 obuffer 1 buffer 0
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 0
                                                                      Allocating Host buffer for device 0 obuffer 1 buffer 1
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 1
                                                                      Allocating Host buffer for device 0 obuffer 1 buffer 2
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 2
                                                                      Allocating Host buffer for device 0 obuffer 1 buffer 3
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 3
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 5
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 6
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 7
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 8
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 9
                                                                      Allocating device buffer for device 0 obuffer 1 buffer 10
                                                                      There was an error in allocating resources and binding them to memory
                                                                      Error initializing CALDGEMM

                                                                       

                                                                      Thanks