cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

richeek_arya
Journeyman III

Running multiple independent kernels on a single GPU

Hi all,

This question is dealt with multiple forums. However, I did not get clear picture as people have conflicting opinions.So I am asking it again.

Suppose my current kernel is not using GPU resources completely, say only 10 threads are running at a time.  Hence to harness the full computational power can I run another independent kernel along with the current kernel on GPUs?

My guess is that it is not possible. Somewhere in AMD's GPU literature I read that if I try to run two kernels simultaneously they will automatically be serialized. However in OpenCL there is an option to create multiple independent command queues which dont require any synchronization. So if I put first kernel in the first queue and the second in the another would it make both of them running simultaneously?

I have Radeon 5450 graphics card just if anyone is curious.

Thanks!

0 Likes
8 Replies
nou
Exemplar

from theroretical point of view. when you enable out of order queue then enqueueing two kernels without dependeci should perform paralely. or when you create two queues on the same devices then again it can execute parralely.

but AFAIK there is currenty no support for concurent kernels on AMD GPU's.

0 Likes
HarryH
Journeyman III

Here is the pseudocode how I did it using OpenMP. In order to run multiple instances of

the same kernel you need to create multiple kernel objects in different threads because

clSetKernelArgs is not threadsafe. This worked on my system (see below)

#include <omp.h> find / select platform create context on GPU device find /select devices associated with context read kernel source / create program build program create any memobjs to be shared RO among kernels in different threads #pragma omp parallel { create the command queue create the kernel objects create private memobjs for this thread set kernel arguments execute kernels release kernels release private memobjs release commandqueue free any per thread malloc'ed buffers etc. } release shared memobjs release program release context free globally allocated buffers exit

0 Likes

If you create kernels and issue them on different queues the 2nd kernel will not execute until the 1st one is done (on GPUs). You could use multiple command queues to overlap IO but enqueing kernels wouldnt help

Adding the previous thread just in case. Its pretty recent ~ 3 mths

http://devforums.amd.com/devforum/messageview.cfm?catid=390&threadid=142485&messid=1187010&parentid=...

 

0 Likes

Does the driver maintain an ordering on PCI transfers with respect to kernels that read and write to them, or is it a crapshoot? For example, suppose I have kernel A running that uses buffer B. Then I enqueue a task in the second queue to write to buffer B. Will the DMA happen alongside execution (bad), will it block until A completes, or will the copy block until ALL kernels issued before it that reference it regardless of queue complete. The first and second way can give you race conditions, while the third way can give you deadlocks.

0 Likes

This doesn't happen right now with OpenCL. There are technical reasons that I can't go in to. It does work fine in DX. AMD GPUs currently run multiple kernels in an uncontrollable way: they do not allow you to split the device. The device does run complicated task graphs happily in the background because that's how graphics code works. Unfortunately right now it only works within the restrictions of the DX API rather than the more general OpenCL API.

0 Likes

rick.weber: if you don't have out of order queue then you don't get any overlap execution/transfers. each enqueued item is completed before next one. in order queues have implicit synchronization.

in out of order and/or multi queues you must ensure proper synchronization with event objects.

LeeHowes: do you mean that with DX and maybe OpenGL it can run concurent shaders? like veretex and pixel shader cocurently? or it is also direct compute?

0 Likes

Originally posted by: nou rick.weber: if you don't have out of order queue then you don't get any overlap execution/transfers. each enqueued item is completed before next one. in order queues have implicit synchronization.

 

in out of order and/or multi queues you must ensure proper synchronization with event objects.

 

So, out of order execution and multiple queues are equivalent then, other than you can impose some ordering without events in the multi-queue in-order case. I'm contemplating adding another queue to clUtil when you enable OOO execution mode. One queue will be dedicated to data transfers and one will be dedicated to kernels in this mode, allowing you to overlap communication and execution. Then you use callbacks from the asynchronous versions of clUtil calls to manage dependencies (e.g. don't enqueue it if it can't run immediately). If I recall, AMD OpenCL doesn't currently support out-of-order queues.

0 Likes

nou,
AMD hardware has been able to run concurrent shaders since R6XX in graphics mode.
0 Likes