3 Replies Latest reply on Jul 18, 2011 9:15 PM by LeeHowes

    Multiple GPU OpenCL kernel execution being serialized

    asy1502

      I have been working on multiple GPUs (2x Firepro 3D 7800) on Ubuntu 10.04 x86_64. I have created two simple examples 1 a vector addition and one that creates a negative of an image.

      I have them executing successfully and the time returned from the queue event shows perfect speed-up over a single GPU. The problem is that I have timers around the execute kernel statement. From those timers, I see no speed-up and usually a speed decrease.

      I began printing out queue submit and queue start times. I found that the second gpu kernel won't begin execution until the first ends.

      I have the latest driver 8.86.5.

      Below is my kernel launch code:
                  timers[timer_name[timer_num+1]]->start();
                  #pragma omp parallel for private(i)//, schedule(static,1)
                  for(i = 0; i<num_gpus; i++)
                  {
                      try
                      {
                          cli->err = cli->queue.enqueueNDRangeKernel(kernels,cl::NullRange,  cl::NDRange(x,y/num_gpus),cl::NullRange , NULL, &event_execute);
                      }
                      catch (cl::Error er)
                      {
                          printf("j = %d, num_gpus = %d, i = %d\n",j,num_gpus,i);
                          printf("ERROR: %s(%s)\n", er.what(), oclErrorString(er.err()));
                      }
                  }
                  for(i = 0; i<num_gpus; i++)
                  {
                      cli->queue
      .finish();
                  }
                  timers[timer_name[timer_num+1]]->stop();

      Here are my printouts from the negative image kernel:

      Negative calculation on GPU # 1 of 1:
          Submit Time:        248512122.967647999525070
          Queue Time:        248512122.961299985647202
          Start Time:        248512123.145188987255096
          End Time:        248512126.960956990718842
          Minimum Time:        2.731202000000000
          Maximum Time:        3.815768000000000
          Average Time:        2.865470800000000
          Total Time:        28.654707999999996
          Count:        10

      Negative calculation on GPU # 1 of 2:
          Submit Time:        248512260.716033995151520
          Queue Time:        248512260.707136988639832
          Start Time:        248512260.910378992557526
          End Time:        248512262.280068993568420
          Minimum Time:        1.368542000000000
          Maximum Time:        2.042697000000000
          Average Time:        1.528395200000000
          Total Time:        15.283951999999999
          Count:        10

      Negative calculation on GPU # 2 of 2:
          Submit Time:        248512262.304941982030869
          Queue Time:        248512260.690681993961334
          Start Time:        248512262.438205987215042
          End Time:        248512263.875981003046036
          Minimum Time:        1.368398000000000
          Maximum Time:        2.052437000000000
          Average Time:        1.572541900000000
          Total Time:        15.725419000000000
          Count:        10

       

      manipulating img on 1 GPUs                          |  avg:     4.4911  |  tot:    44.9110  |  count=    10
      manipulating img on 2 GPUs                          |  avg:     3.8605  |  tot:    38.6050  |  count=    10