asy1502

Multiple GPU OpenCL kernel execution being serialized

Discussion created by asy1502 on Jul 18, 2011
Latest reply on Jul 18, 2011 by LeeHowes

I have been working on multiple GPUs (2x Firepro 3D 7800) on Ubuntu 10.04 x86_64. I have created two simple examples 1 a vector addition and one that creates a negative of an image.

I have them executing successfully and the time returned from the queue event shows perfect speed-up over a single GPU. The problem is that I have timers around the execute kernel statement. From those timers, I see no speed-up and usually a speed decrease.

I began printing out queue submit and queue start times. I found that the second gpu kernel won't begin execution until the first ends.

I have the latest driver 8.86.5.

Below is my kernel launch code:
            timers[timer_name[timer_num+1]]->start();
            #pragma omp parallel for private(i)//, schedule(static,1)
            for(i = 0; i<num_gpus; i++)
            {
                try
                {
                    cli->err = cli->queue.enqueueNDRangeKernel(kernels,cl::NullRange,  cl::NDRange(x,y/num_gpus),cl::NullRange , NULL, &event_execute);
                }
                catch (cl::Error er)
                {
                    printf("j = %d, num_gpus = %d, i = %d\n",j,num_gpus,i);
                    printf("ERROR: %s(%s)\n", er.what(), oclErrorString(er.err()));
                }
            }
            for(i = 0; i<num_gpus; i++)
            {
                cli->queue
.finish();
            }
            timers[timer_name[timer_num+1]]->stop();

Here are my printouts from the negative image kernel:

Negative calculation on GPU # 1 of 1:
    Submit Time:        248512122.967647999525070
    Queue Time:        248512122.961299985647202
    Start Time:        248512123.145188987255096
    End Time:        248512126.960956990718842
    Minimum Time:        2.731202000000000
    Maximum Time:        3.815768000000000
    Average Time:        2.865470800000000
    Total Time:        28.654707999999996
    Count:        10

Negative calculation on GPU # 1 of 2:
    Submit Time:        248512260.716033995151520
    Queue Time:        248512260.707136988639832
    Start Time:        248512260.910378992557526
    End Time:        248512262.280068993568420
    Minimum Time:        1.368542000000000
    Maximum Time:        2.042697000000000
    Average Time:        1.528395200000000
    Total Time:        15.283951999999999
    Count:        10

Negative calculation on GPU # 2 of 2:
    Submit Time:        248512262.304941982030869
    Queue Time:        248512260.690681993961334
    Start Time:        248512262.438205987215042
    End Time:        248512263.875981003046036
    Minimum Time:        1.368398000000000
    Maximum Time:        2.052437000000000
    Average Time:        1.572541900000000
    Total Time:        15.725419000000000
    Count:        10

 

manipulating img on 1 GPUs                          |  avg:     4.4911  |  tot:    44.9110  |  count=    10
manipulating img on 2 GPUs                          |  avg:     3.8605  |  tot:    38.6050  |  count=    10

Outcomes