I'm trying to distribute work to two different GPUs efficiently by using dynamic scheduling.
My first attempt is working, but with one of the GPUs idling at one point for no apparent reason.
The basic scheduling algorithm is as follows:
Loop through GPUs (loop control i)
lock work queue
pop item off work queue
unlock work queue
set work item's target device to i
call enqueuework function (work item)
Wait for queue to become empty
WaitForEvents(wait on all reads to complete)
EnqueueWork function (work item):
create required buffers
enqueueWrite on buffers using writeQueue
enqueueMarker on writeQueue
flush write queue
create and enqueue kernels (depends on above writeQueue marker) on execQueue
set callback on kernel completion to RunComplete function
flush exec queue
enqueueRead on readQueue (depends on kernel completion)
set callback on read complete to ReadComplete function
flush read queue
gets an item from the work queue and calls the EnqueueWork function
Creates a thread to write the results to file
Note: All OpenCL calls are asynchronous and each device has its own set of queues.
The picture attached is the execution profile. As you can see from the image, the Cayman device isn't doing anything for ~3s, yet the buffers were written and the kernel was enqueued at the expected time. It only started when the Tahiti device finished its kernel, yet the inverse doesn't apply (Tahiti starts new kernels while Cayman is still running). Any ideas as to why this is?
From the image, queue2 and queue5 do data transfer, queue0 and queue3 do kernel execution. So I think maybe you have made a mistake in your program.
The threads in host are concurrent. In your image, thread 1772 is waiting for thread 3088. So Cayman doesn't do anything for 3s. First, thread1772 holds cpu resource, and Cayman starts to work. Sencond, thread3088 holds cpu, and tahiti starts to work. Because the gpus are parallel, there is nothing wrong.
Thanks for your input. I am however unconvinced that the problem is being caused by host thread blocking.
The host threads don't do all that much, they just schedule the next batch once a batch completes. This shouldn't take anywhere near 3 seconds to complete. Furthermore, I'm using a quad core CPU, so threads should execute in parallel (time-slicing on a single core CPU should even be adequate).
Yep! In theory, the kernel should be executed immediately after the data transfer. And maybe it's the events in your code caused this result. Just speculating. You can change executing order, Cayman kernel executes first, and then Tahiti.
I'm not having much luck.
I've tried all sorts of things, such as:
- Dedicating a thread to each device for scheduling
- Use of blocking instead of event dependencies (each device on own thread, so doesn't slow it down)
- Giving each device its own context. I thought this would definitely resolve it, but alas not.
The kernels are scheduled at the right time, but sometimes they just sit in the queue until the other device has finished the kernels it is busy with. I have noticed that its always only one of the devices affected. So one device just goes for it, while the other only runs in parallel with the first device on occasion.