cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

uvedale
Journeyman III

Multiple GPUs and concurrent kernel execution

Hi,

I'm trying to distribute work to two different GPUs efficiently by using dynamic scheduling.

My first attempt is working, but with one of the GPUs idling at one point for no apparent reason.

The basic scheduling algorithm is as follows:

Main func:

    Loop through GPUs (loop control i)

        lock work queue

        pop item off work queue

        unlock work queue

        set work item's target device to i

        call enqueuework function (work item)

    Wait for queue to become empty

    WaitForEvents(wait on all reads to complete)

EnqueueWork function (work item):

    create required buffers

    enqueueWrite on buffers using writeQueue

    enqueueMarker on writeQueue

    flush write queue

    create and enqueue kernels (depends on above writeQueue marker) on execQueue

    set callback on kernel completion to RunComplete function

    flush exec queue

    enqueueRead on readQueue (depends on kernel completion)

    set callback on read complete to ReadComplete function

    flush read queue

RunComplete function:

    gets an item from the work queue and calls the EnqueueWork function

ReadComplete function:

    Creates a thread to write the results to file

Note: All OpenCL calls are asynchronous and each device has its own set of queues.

The picture attached is the execution profile. As you can see from the image, the Cayman device isn't doing anything for ~3s, yet the buffers were written and the kernel was enqueued at the expected time. It only started when the Tahiti device finished its kernel, yet the inverse doesn't apply (Tahiti starts new kernels while Cayman is still running). Any ideas as to why this is?

0 Likes
10 Replies
Wenju
Elite

Hi uvedale,

From the image, queue2 and queue5 do data transfer, queue0 and queue3 do kernel execution. So I think maybe you have made a mistake in your program.

0 Likes

I don't see a problem with that?

I have 6 queues in total, 3 for each device (execution queue, read queue, and write queue).

0 Likes

Hi uvedale,

The threads in host are concurrent. In your image, thread 1772 is waiting for thread 3088. So Cayman doesn't do anything for 3s. First, thread1772 holds cpu resource, and Cayman starts to work. Sencond, thread3088 holds cpu, and  tahiti starts to work. Because the gpus are parallel, there is nothing wrong.

0 Likes

Hi Wenju,

Thanks for your input. I am however unconvinced that the problem is being caused by host thread blocking.

The host threads don't do all that much, they just schedule the next batch once a batch completes. This shouldn't take anywhere near 3 seconds to complete. Furthermore, I'm using a quad core CPU, so threads should execute in parallel (time-slicing on a single core CPU should even be adequate).

0 Likes

Yep! In theory, the kernel should be executed immediately after the data transfer. And maybe it's the events in your code caused this result. Just speculating. You can change executing order, Cayman kernel executes first, and then Tahiti.

0 Likes

try remove some synhronization between queues and reprofile it. maybe you will be able to determine if there isn't some blocking event.

0 Likes

I'm not having much luck.

I've tried all sorts of things, such as:

- Dedicating a thread to each device for scheduling

- Use of blocking instead of event dependencies (each device on own thread, so doesn't slow it down)

- Giving each device its own context. I thought this would definitely resolve it, but alas not.

The kernels are scheduled at the right time, but sometimes they just sit in the queue until the other device has finished the kernels it is busy with. I have noticed that its always only one of the devices affected. So one device just goes for it, while the other only runs in parallel with the first device on occasion.

0 Likes

Can you offer the session files? Maybe it will be helpful.

0 Likes

Sure.

I've attached 3 sessions.

The last session is where I'm using different contexts, different threads for each device, and write all the data for all the runs prior to launching the kernels.

0 Likes

It's really hard to say. But I really think it's your program caused such kind of result. In your program, you invoked clRetainMemObject() so many times. I'm not sure whether it will have an effect on the result. Besides, you should be careful with the events. And I think you can just use one device, if the kernel executions are not continuous, so I think it's your program. But if the kernel executions are continuous, we can not say it's not the program caused such kind of result.

0 Likes