I'm trying to distribute work to two different GPUs efficiently by using dynamic scheduling.
My first attempt is working, but with one of the GPUs idling at one point for no apparent reason.
The basic scheduling algorithm is as follows:
Loop through GPUs (loop control i)
lock work queue
pop item off work queue
unlock work queue
set work item's target device to i
call enqueuework function (work item)
Wait for queue to become empty
WaitForEvents(wait on all reads to complete)
EnqueueWork function (work item):
create required buffers
enqueueWrite on buffers using writeQueue
enqueueMarker on writeQueue
flush write queue
create and enqueue kernels (depends on above writeQueue marker) on execQueue
set callback on kernel completion to RunComplete function
flush exec queue
enqueueRead on readQueue (depends on kernel completion)
set callback on read complete to ReadComplete function
flush read queue
gets an item from the work queue and calls the EnqueueWork function
Creates a thread to write the results to file
Note: All OpenCL calls are asynchronous and each device has its own set of queues.
The picture attached is the execution profile. As you can see from the image, the Cayman device isn't doing anything for ~3s, yet the buffers were written and the kernel was enqueued at the expected time. It only started when the Tahiti device finished its kernel, yet the inverse doesn't apply (Tahiti starts new kernels while Cayman is still running). Any ideas as to why this is?