cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

rj_marques
Journeyman III
Journeyman III

Command Queue performance

Jump to solution

Greatings

Please consider the following scenario:

A thread pool comprised of N host-threads, each responsible for an independant OpenCL execution, i.e., managing memory objects, a kernel, etc.

Regarding command queues, in this scenario, what is more efficient?

- Having a shared queue, on which every thread issues commands for its execution.

- Having a out of order shared command queue, on which every thread issues commands for its execution, and internally synchronizes via events.

- Having N command queues, one for each thread.

In my experience, having multiple command queues may hamper performance if the number of queues scales. On the other hand, for what i've read, AMD's OpenCL does not give support to concurrent kernels.

Thanks in advance for your opinions. =D

0 Kudos
Reply
1 Solution

Accepted Solutions
MicahVillmow
Staff
Staff

Re: Command Queue performance

Jump to solution

You can pipeline kernel execution, kernel transfer and kernel creation. We have apps that do this and are very efficient.

So basically what you would do is something like this:

setup buffer N & N + 1

enqueue buffer N

enqueue kernel N

setup buffer N + 2

enqueue Buffer N + 1

enqueue kernel N + 1

readback buffer N

setup buffer N + 3

enqueue Buffer N + 2

enqueue kernel N + 2

readback buffer N + 1

etc...

View solution in original post

0 Kudos
Reply
13 Replies
tzachi_cohen
Staff
Staff

Re: Command Queue performance

Jump to solution

To gain better utilization of the GPU it is advised to run several queues concurrently. Using multiple queues concurrently may reduce GPU idle time, sometimes referred to as GPU bubbles.

rj_marques
Journeyman III
Journeyman III

Re: Command Queue performance

Jump to solution

Ok, I thought so to.

Another question. Having multiple queues offers the possibility of queuing multiple concurrent reads from and writes to, OpenCL device memory. Does queuing multiple data-transfers increase GPU idle time? Or are the the transfers done in parallel with kernel execution?

0 Kudos
Reply
tzachi_cohen
Staff
Staff

Re: Command Queue performance

Jump to solution

Async copies allow memory transfer operations to execute in parallel to kernel execution. On SDK 2.6 the feature is in preview mode. (set environment variable GPU_ASYNC_MEM_COPY=2 to enable).

Queuing multiple data transfer commands does not necessarily cause GPU bubbles. Most of the time it is due to issuing blocking commands such as 'clFinish' or blocking clEnqueueMap.

0 Kudos
Reply
rj_marques
Journeyman III
Journeyman III

Re: Command Queue performance

Jump to solution

I've created an aplication where the host does the following steps hundreds of times:

1 - Copy data to device memory (Blocking)

2 - Enqueue NDRange (Non blocking)

3 - Read from device memory (Non blocking)

5 - Notify the host, using callbacks

4 - Clean up

There are two distinct queues, one used for kernel execution and another for reads/writes.

If the host enqueues everything and then sits back waiting for all results, the execution time is about 10-12 seconds.

If the host enqueus one iteration worth of commands and waits until they finish before starting the next iteration, the execution time is about 1.3 seconds.

Why such a differance? Are the multiple enqueues slowing the kernel executions?

Setting the GPU_ASYNC_MEM_COPY variable did not change the execution times.

FTR, the GPU on which the execution took place is part of a HD4850.

0 Kudos
Reply
tzachi_cohen
Staff
Staff

Re: Command Queue performance

Jump to solution

It is difficult to say what is wrong. If you are using Visual Studio I suggest you use AMD APP Profiler. It will draw an execution timeline and illustrate which command is executed at any given moment.

If you are not using Visual Studio you can use sprofile.exe. It is the command line back-end of AMD APP Profiler.

0 Kudos
Reply
rj_marques
Journeyman III
Journeyman III

Re: Command Queue performance

Jump to solution

I've use AMD APP Profiler, which resulted in the following execution trace.

It seams that the creation and release of buffers is slowing down the application. In turn, this causes the enqueueNDRangeKernel function to slow down as well.

The GPU is idle most of the time. This behaviour seams to show that AMD APP is not cabable of dealing with too much concurrency at the API level.

I wonder if this behaviour is only bound to AMD's OpenCL implementation, or is also present in other vendors.

profile.jpg

0 Kudos
Reply
MicahVillmow
Staff
Staff

Re: Command Queue performance

Jump to solution

rj.marques,

Why are you not caching mem objects on the host side, or using double buffering and why are you releasing mem objects so often? Releasing mem objects is not supposed to be a fast operation. When you release a mem object, we have to go and destroy its associated memory on the device. Since we do lazy initialization of memory, and you deleted all the old memory with the release mem object, we must re-allocate on every kernel launch.

rj_marques
Journeyman III
Journeyman III

Re: Command Queue performance

Jump to solution

I see. From the trace I kinda guessed that the release of memory objects was very heavy, since it causes the GPU to become idle.

If I create an N-Buffer like scheme, where the GPU is executing on buffer j while I am loading data to other buffers, should I only enqueue one Kernel at a time? Or can I enqueue N kernels? I know that kernel concurrency is not available in AMD's OpenCL implementation, I am just wondering if enqueuing multiple kernels will have a negative effect on performance.

Thanks for everyone's replies =D

0 Kudos
Reply
MicahVillmow
Staff
Staff

Re: Command Queue performance

Jump to solution

You can pipeline kernel execution, kernel transfer and kernel creation. We have apps that do this and are very efficient.

So basically what you would do is something like this:

setup buffer N & N + 1

enqueue buffer N

enqueue kernel N

setup buffer N + 2

enqueue Buffer N + 1

enqueue kernel N + 1

readback buffer N

setup buffer N + 3

enqueue Buffer N + 2

enqueue kernel N + 2

readback buffer N + 1

etc...

View solution in original post

0 Kudos
Reply