8 Replies Latest reply on Sep 1, 2014 7:01 AM by w00

    Execution of OpenCL Kernel on more than one GPU

    w00

      Hello,

       

      i want to execute different instances of the same kernel on 2 AMD Radeon HD 6870 at the same time.

      Every kernel instance is called by one instance of the host program.

      I have a AMD Sempron(tm) 140 Processor. It has 1 core only.

       

      1 cpu core --> host program --> kernel instance

                     ---> host program --> kernel instance

       

      The problem is, that while 1 kernel is executed on 1 gpu, i can't start the other instance of the kernel on the other gpu.

      The second gpu has to wait until the first kernel execution has finished.

       

      Can i make it work?

      or could i execute 2 kernels on 2 gpu with a 2 core cpu?

       

      I'm running debian linux:

      Linux node4 2.6.32-5-amd64 #1 SMP Tue May 13 16:34:35 UTC 2014 x86_64 GNU/Linux

       

      amd graphic driver version:

      [fglrx] module loaded - fglrx 14.10.2 [May  5 2014] with 2 minors

       

      NOTE: I'm connecting to a shell by ssh.

       

      greets

        • Re: Execution of OpenCL Kernel on more than one GPU
          nou

          Do you run two instances of your program? Are you sure that you run them on separate devices?

            • Re: Execution of OpenCL Kernel on more than one GPU
              w00

              Hello,

               

              thanks for your answer nou.

               

              i did some tests.

               

              Idle temperatures of the cards:

              user@node4:~$ time amdconfig --adapter=all --odgt

               

              Adapter 0 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 41.00 C

               

              Adapter 1 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 36.00 C

               

              real    0m0.010s

              user    0m0.008s

              sys     0m0.000s

               

              If i run the host program on gpu nr. 1 i get the following temperatures:

               

              user@node4:~$ time amdconfig --adapter=all --odgt

               

              Adapter 0 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 53.00 C

               

              Adapter 1 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 37.00 C

               

              real    0m4.523s

              user    0m0.000s

              sys     0m0.008s

               

              As you can see, amdconfig returns not immediately.

               

              ps aux on pts/1 prints:

              user      2287  119  3.9 215152 156824 pts/0   R+   20:05   0:02 ./main pci-e:02:00.0 1 10000000000

               

              The same procedure for gpu nr. 2 gives me the following:

               

              user@node4:~$ time amdconfig --adapter=all --odgt

               

              Adapter 0 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 42.00 C

               

              Adapter 1 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 50.00 C

               

              real    0m0.010s

              user    0m0.008s

              sys     0m0.000s

               

              As you can see, amdconfig returns immediately.

               

              user      2290 13.3  0.8  93004 34756 pts/2    Sl+  20:06   0:02 ./main pci-e:03:00.0 1 10000000000

               

              Now, 2 instances of the host program at the same time. I started the second host program some seconds later:

               

              user      2329 24.1  0.8  92980 34752 pts/0    Sl+  20:14   0:02 ./main pci-e:02:00.0 1 10000000000

              user      2331 95.7  0.3  80876 13892 pts/2    R+   20:14   0:03 ./main pci-e:03:00.0 1 10000000000

               

              On return of the first host program:

              user      2331 48.7  0.8  93020 34752 pts/2    Sl+  20:14   0:39 ./main pci-e:03:00.0 1 10000000000

               

              The temperatures:

              user@node4:~$ time amdconfig --adapter=all --odgt

               

              Adapter 0 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 53.00 C

               

              Adapter 1 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 38.00 C

               

              real    0m35.927s

              user    0m0.000s

              sys     0m0.008s

               

              It looks like the second card is not running. The return delay of the amdconfig call is very high.

              35 sec looks like the time the first kernel returns and the amdconfig call can completed by the driver.

               

              Now start of the host programs aprox at the same time:

              user      2351 30.3  0.8  93028 34756 pts/0    Sl+  20:21   0:02 ./main pci-e:02:00.0 1 10000000000

              user      2352 34.5  0.8  93028 34752 pts/2    Sl+  20:21   0:02 ./main pci-e:03:00.0 1 10000000000

               

              the temperatures:

              user@node4:~$ time amdconfig --adapter=all --odgt

               

              Adapter 0 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 54.00 C

               

              Adapter 1 - AMD Radeon HD 6800 Series

                          Sensor 0: Temperature - 49.00 C

               

              real    0m4.524s

              user    0m0.004s

              sys     0m0.008s

               

              It looks like both cards can be used at the same time. the host program runtime is aprox the same.

              In the previous case it was not.

               

              Sorry for the long log and thanks for your help.

               

              greets

            • Re: Execution of OpenCL Kernel on more than one GPU
              maxdz8

              You can enqueue (start) K kernels to D devices as long as you don't call clFlush, clFinish or other synchronizing memory barrier (Map, Read...) even on 1 single core.

               

              If you want to get serious help, describe your program!

                • Re: Execution of OpenCL Kernel on more than one GPU
                  w00

                  Hello maxdz8,

                   

                  in my host program after enqueuing the kernel i use the following code:

                   

                  err = clEnqueueReadBuffer(queue, output_buffer, CL_TRUE, 0,
                        sizeof(output_data), output_data, 0, NULL, NULL);

                   

                      if (err < 0) {

                     perror("Couldn't read the buffer");
                     client_cleanup(EXIT_FAILURE);

                      }

                   

                      clFlush(queue);

                      clFinish(queue);

                   

                  Its a blocking read (CL_TRUE). Should i avoid this?

                  The queue is for one card only.

                  I call the host program with variable parameterlist.

                  At the end of the host program i release all allocated cl objects.

                   

                  greets

                    • Re: Execution of OpenCL Kernel on more than one GPU
                      maxdz8

                      You  should really, really, really read the manual w00. You don't just get in programming by packing function calls without knowing what you're doing. Do you know what you're doing?

                       

                      That snippet contains two blocking barriers (+1 interesting call):

                      1. The read call will indeed block your thread until the buffer is copied back;
                      2. clFlush - which is usually not needed - will correctly dispatch one queue to the associated device, unfortunately...
                      3. clFinish will block your thread until GPU completes execution as per specification

                      Doing Flush immediately followed by a Finish is not just nonsense. In the end this does not really matter as the read call is basically a Finish by itself.

                       

                      CL1.2, p121

                      If an application needs to wait for completion of a routine from the above list in a callback, please use the non-blocking form of the function, and assign a completion callback to it to do the remainder of your work. Note that when a callback (or other code) enqueues commands to a command-queue, the commands are not required to begin execution until the queue is flushed. In standard usage, blocking enqueue calls serve this role by implicitly flushing the queue. Since blocking calls are not permitted in callbacks, those callbacks that enqueue commands on a command queue should either call clFlush on the queue before returning or arrange for clFlush to be called later on another thread.

                      CL1.2 p195

                      clFlush (cl_command_queue command_queue) issues all previously queued OpenCL commands in command_queue to the device associated with command_queue. clFlush only guarantees that all queued commands to command_queue will eventually be submitted to the appropriate device. There is no guarantee that they will be complete after clFlush returns.

                      ...

                      Any blocking commands queued in a command-queue and clReleaseCommandQueue perform an implicit flush of the command-queue. These blocking commands are clEnqueueReadBuffer, clEnqueueReadBufferRect, clEnqueueReadImage, with blocking_read set to CL_TRUE; clEnqueueWriteBuffer, clEnqueueWriteBufferRect, clEnqueueWriteImage with blocking_write set to CL_TRUE; clEnqueueMapBuffer, clEnqueueMapImage with blocking_map set to CL_TRUE; or clWaitForEvents.

                      ...

                      clFinish (cl_command_queue command_queue) blocks until all previously queued OpenCL commands in command_queue are issued to the associated device and have completed. clFinish does not return until all previously queued commands in command_queue have been processed and completed. clFinish is also a

                      synchronization point

                      You're creating a sequential constraint across different queues, thereby causing them to work serially.

                        • Re: Execution of OpenCL Kernel on more than one GPU
                          w00

                          hello maxdz8,

                           

                          thanks for your reply.

                          I read your message and i know about the blocking nature of this calls. I don't understand why one of the calls should block another host program from enqueueing commands in the queue of the second gpu.

                          host program nr 1 only handles gpu nr 1 and host program nr 2 handles gpu nr 2.

                           

                          If i reduce my host program to:

                           

                          ...

                           

                          /* Create a command queue */

                              queue = clCreateCommandQueue(context, device, 0, &err);

                           

                              if (err < 0) {

                             perror("Couldn't create a command queue");
                             client_cleanup(EXIT_FAILURE);

                              }

                           

                              /* Create a kernel */

                              kernel = clCreateKernel(program, KERNEL_FUNC, &err);

                           

                              if (err < 0) {

                             perror("Couldn't create a kernel");
                             client_cleanup(EXIT_FAILURE);

                              }

                           

                              /* Create kernel arguments */

                              err = clSetKernelArg(kernel, 0, sizeof(input_buffer), &input_buffer);

                              err |= clSetKernelArg(kernel, 1, sizeof(output_buffer), &output_buffer);

                           

                              if (err < 0) {

                             perror("Couldn't create a kernel argument");
                             client_cleanup(EXIT_FAILURE);

                              }

                           

                              debug("enqueueing kernel\n");

                           

                              /* Enqueue kernel */

                              err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size,

                               &local_size, 0, NULL, NULL);

                           

                              if (err < 0) {

                             perror("Couldn't enqueue the kernel");
                             client_cleanup(EXIT_FAILURE);

                              }

                           

                              sleep(60);

                           

                          /* Deallocate resources */

                              clReleaseKernel(kernel);

                              clReleaseMemObject(output_buffer);

                              clReleaseMemObject(input_buffer);

                              clReleaseCommandQueue(queue);

                              clReleaseProgram(program);

                              clReleaseContext(context);

                           

                              client_cleanup(EXIT_SUCCESS);

                          -----------------------------------------------------------------

                          Now i should be able to call a clXXX function (for example clGetPlatformIDs()) from host program nr.  2.

                          But every clXXX-call from host program nr 2. will block until the kernel from host program nr. 1 got finished.

                           

                          At the moment i don't really know where the problem could be.

                           

                          greets

                            • Re: Execution of OpenCL Kernel on more than one GPU
                              nou

                              try run on second gpu first. when is GPU busy with kernel execution it can't update display.

                               

                              proper way to handle multiple GPU from single thread is enqueue kernels on all GPU, then clFlush() and lastly call some blocking sync function like clFinih() or clWaitEvents(); but you are running two instances of program so they should run in parallel. there is multiGPU example in SDK. try run that one and see if how does behave that one.

                      • Re: Execution of OpenCL Kernel on more than one GPU
                        w00

                        Hello,

                         

                        i found a solution for my problem.

                         

                        The host program is processing chunks of data in a loop now.

                        after every iteration, it waits for a small time to sync with other host programs running in parallel on the single core processor.

                        This way i can use both cards at the same time. Further i have the ability to interrupt the host program and kernel now.

                         

                        Here is a link to the host program.

                         

                        i set constants to get the best performance on my hardware:

                        #define PACKET_COUNT    8192

                        #define GROUP_SIZE    32

                        #define NUM_PER_THREAD    100000UL /* 64 bit for me */

                        #define CHUNK_SIZE    (PACKET_COUNT * NUM_PER_THREAD)

                        #define WAIT_SEC    0

                        #define WAIT_NSEC    100000000

                         

                        execution example:

                        ./main pci-e:02:00.0 '1;819201337'

                        begin: 1

                        end: 819201337

                        getocldev(): opencl capable devices found: 2

                        ocldevid(): preparing for search

                        ocldevid(): cleaning up device list

                        local_size: 32

                        global_size: 8192

                        chunk_count: 2

                        NUM_PER_THREAD: 100000

                        CHUNK_SIZE: 819200000

                        current_chunk: 0

                        current_chunk.first: 1

                        current_chunk.last: 819200000

                        packet_size: 100000

                        last_packet: 100000

                        last_packet_begin: 819100001

                        last_packet_end: 819200000

                        enqueueing kernel

                        [1,100000]: success

                        current_chunk: 1

                        current_chunk.first: 819200001

                        current_chunk.last: 819201337

                        packet_size: 0

                        last_packet: 1337

                        packet_size = 0

                        filling up 8191 packets with dummy data

                        last_packet_begin: 819200001

                        last_packet_end: 819201337

                        enqueueing kernel

                         

                        greets

                        1 of 1 people found this helpful