4 Replies Latest reply on Jan 3, 2011 9:11 AM by himanshu.gautam

    Serial execution on multiple devices

    eklund.n
      Doing half the work on 2 GPUs takes as long as all work on 1 GPU

      Hi.

      I know that this has been mentioned before (http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=138807), but I have missed if a solution has been mentioned.

      I have split the input array over both devices so that each card only does half the work compared to when I only used 1 GPU. The event profiling info also states the each card only work for half the time.

      But the total time in host for running the kernel on the 2 devices isn't lowered. In fact it's marginally longer. Look at the attached code. For a certain input size the time is reported as 330 ms. If I remove all lines regarding command_queue[1]/device[1]/kernel[1], the time is around 160 ms.

      I have tried "1 context-2 command_queues" and "2 contexts-2 command_queues" with pretty much the exact same results. Am I doing something wrong or is there a fix on the way?

      Kind regards, Eklund

      SYSTEM:
      Ubuntu 10.04.1 64 bit, SDK 2.2, Catalyst 10.9
      i7 950, 2x HD5870 

       

      ... clFlush(command_queue[0]); clFlush(command_queue[1]); clFinish(command_queue[0]); clFinish(command_queue[1]); clock_gettime(CLOCK_REALTIME, &start); status = clEnqueueNDRangeKernel(command_queue[0], kernel[0], 1, NULL, &globalSize[0], &localSize, 0, NULL, &event[0]); status = clEnqueueNDRangeKernel(command_queue[1], kernel[1], 1, NULL, &globalSize[1], &localSize, 0, NULL, &event[1]); clFlush(command_queue[0]); clFlush(command_queue[1]); clFinish(command_queue[0]); clFinish(command_queue[1]); clock_gettime(CLOCK_REALTIME, &end); double time = (end.tv_sec-start.tv_sec)*1000.0+(end.tv_nsec-start.nsec)/1000000.0; printf("%f\n", time"); ...