cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rj_marques
Journeyman III

Command Queue performance

Greatings

Please consider the following scenario:

A thread pool comprised of N host-threads, each responsible for an independant OpenCL execution, i.e., managing memory objects, a kernel, etc.

Regarding command queues, in this scenario, what is more efficient?

- Having a shared queue, on which every thread issues commands for its execution.

- Having a out of order shared command queue, on which every thread issues commands for its execution, and internally synchronizes via events.

- Having N command queues, one for each thread.

In my experience, having multiple command queues may hamper performance if the number of queues scales. On the other hand, for what i've read, AMD's OpenCL does not give support to concurrent kernels.

Thanks in advance for your opinions. =D

0 Likes
1 Solution

You can pipeline kernel execution, kernel transfer and kernel creation. We have apps that do this and are very efficient.

So basically what you would do is something like this:

setup buffer N & N + 1

enqueue buffer N

enqueue kernel N

setup buffer N + 2

enqueue Buffer N + 1

enqueue kernel N + 1

readback buffer N

setup buffer N + 3

enqueue Buffer N + 2

enqueue kernel N + 2

readback buffer N + 1

etc...

View solution in original post

0 Likes
13 Replies

To gain better utilization of the GPU it is advised to run several queues concurrently. Using multiple queues concurrently may reduce GPU idle time, sometimes referred to as GPU bubbles.

Ok, I thought so to.

Another question. Having multiple queues offers the possibility of queuing multiple concurrent reads from and writes to, OpenCL device memory. Does queuing multiple data-transfers increase GPU idle time? Or are the the transfers done in parallel with kernel execution?

0 Likes

Async copies allow memory transfer operations to execute in parallel to kernel execution. On SDK 2.6 the feature is in preview mode. (set environment variable GPU_ASYNC_MEM_COPY=2 to enable).

Queuing multiple data transfer commands does not necessarily cause GPU bubbles. Most of the time it is due to issuing blocking commands such as 'clFinish' or blocking clEnqueueMap.

0 Likes

I've created an aplication where the host does the following steps hundreds of times:

1 - Copy data to device memory (Blocking)

2 - Enqueue NDRange (Non blocking)

3 - Read from device memory (Non blocking)

5 - Notify the host, using callbacks

4 - Clean up

There are two distinct queues, one used for kernel execution and another for reads/writes.

If the host enqueues everything and then sits back waiting for all results, the execution time is about 10-12 seconds.

If the host enqueus one iteration worth of commands and waits until they finish before starting the next iteration, the execution time is about 1.3 seconds.

Why such a differance? Are the multiple enqueues slowing the kernel executions?

Setting the GPU_ASYNC_MEM_COPY variable did not change the execution times.

FTR, the GPU on which the execution took place is part of a HD4850.

0 Likes

It is difficult to say what is wrong. If you are using Visual Studio I suggest you use AMD APP Profiler. It will draw an execution timeline and illustrate which command is executed at any given moment.

If you are not using Visual Studio you can use sprofile.exe. It is the command line back-end of AMD APP Profiler.

0 Likes

I've use AMD APP Profiler, which resulted in the following execution trace.

It seams that the creation and release of buffers is slowing down the application. In turn, this causes the enqueueNDRangeKernel function to slow down as well.

The GPU is idle most of the time. This behaviour seams to show that AMD APP is not cabable of dealing with too much concurrency at the API level.

I wonder if this behaviour is only bound to AMD's OpenCL implementation, or is also present in other vendors.

profile.jpg

0 Likes

rj.marques,

Why are you not caching mem objects on the host side, or using double buffering and why are you releasing mem objects so often? Releasing mem objects is not supposed to be a fast operation. When you release a mem object, we have to go and destroy its associated memory on the device. Since we do lazy initialization of memory, and you deleted all the old memory with the release mem object, we must re-allocate on every kernel launch.

I see. From the trace I kinda guessed that the release of memory objects was very heavy, since it causes the GPU to become idle.

If I create an N-Buffer like scheme, where the GPU is executing on buffer j while I am loading data to other buffers, should I only enqueue one Kernel at a time? Or can I enqueue N kernels? I know that kernel concurrency is not available in AMD's OpenCL implementation, I am just wondering if enqueuing multiple kernels will have a negative effect on performance.

Thanks for everyone's replies =D

0 Likes

You can pipeline kernel execution, kernel transfer and kernel creation. We have apps that do this and are very efficient.

So basically what you would do is something like this:

setup buffer N & N + 1

enqueue buffer N

enqueue kernel N

setup buffer N + 2

enqueue Buffer N + 1

enqueue kernel N + 1

readback buffer N

setup buffer N + 3

enqueue Buffer N + 2

enqueue kernel N + 2

readback buffer N + 1

etc...

0 Likes

In this scenario, I should either use two command queues(execution and read/write queues), or just one out-of-order queue?

0 Likes

Use two queue's, we don't support out of order queue's.

0 Likes

Hello Micah,

I have the same question as Tim, but i'm using 7970. BTW, can Read/Write Rect functions also be overlapped? It will help if you can provide some working codes examples.

Thanks,

Shawn

0 Likes

Hi Micah,

I'm trying to implement exactly the same scenario. It works quite well in CUDA, but I can't make it work with an AMD GPU (5850). Probably I'm doing something wrong (e. g. not calling clFlush in the right moments). I have also heard that overlapping is automatically disable in profiler, is it correct? Will it also be disabled in queues created with CL_QUEUE_PROFILING_ENABLE?

Do you have any working code samples of such a pipeline? Also, with SDK 2.7, is it still required to set GPU_ASYNC_MEM_COPY to 2?

Thank you

0 Likes