10 Replies Latest reply on Jul 25, 2012 10:09 PM by Wenju

    Multiple GPUs and concurrent kernel execution




      I'm trying to distribute work to two different GPUs efficiently by using dynamic scheduling.

      My first attempt is working, but with one of the GPUs idling at one point for no apparent reason.


      The basic scheduling algorithm is as follows:

      Main func:

          Loop through GPUs (loop control i)

              lock work queue

              pop item off work queue

              unlock work queue

              set work item's target device to i

              call enqueuework function (work item)

          Wait for queue to become empty

          WaitForEvents(wait on all reads to complete)


      EnqueueWork function (work item):

          create required buffers

          enqueueWrite on buffers using writeQueue

          enqueueMarker on writeQueue

          flush write queue

          create and enqueue kernels (depends on above writeQueue marker) on execQueue

          set callback on kernel completion to RunComplete function

          flush exec queue

          enqueueRead on readQueue (depends on kernel completion)

          set callback on read complete to ReadComplete function

          flush read queue


      RunComplete function:

          gets an item from the work queue and calls the EnqueueWork function


      ReadComplete function:

          Creates a thread to write the results to file


      Note: All OpenCL calls are asynchronous and each device has its own set of queues.


      The picture attached is the execution profile. As you can see from the image, the Cayman device isn't doing anything for ~3s, yet the buffers were written and the kernel was enqueued at the expected time. It only started when the Tahiti device finished its kernel, yet the inverse doesn't apply (Tahiti starts new kernels while Cayman is still running). Any ideas as to why this is?