1 of 1 people found this helpful
To gain better utilization of the GPU it is advised to run several queues concurrently. Using multiple queues concurrently may reduce GPU idle time, sometimes referred to as GPU bubbles.
Ok, I thought so to.
Another question. Having multiple queues offers the possibility of queuing multiple concurrent reads from and writes to, OpenCL device memory. Does queuing multiple data-transfers increase GPU idle time? Or are the the transfers done in parallel with kernel execution?
Async copies allow memory transfer operations to execute in parallel to kernel execution. On SDK 2.6 the feature is in preview mode. (set environment variable GPU_ASYNC_MEM_COPY=2 to enable).
Queuing multiple data transfer commands does not necessarily cause GPU bubbles. Most of the time it is due to issuing blocking commands such as 'clFinish' or blocking clEnqueueMap.
I've created an aplication where the host does the following steps hundreds of times:
1 - Copy data to device memory (Blocking)
2 - Enqueue NDRange (Non blocking)
3 - Read from device memory (Non blocking)
5 - Notify the host, using callbacks
4 - Clean up
There are two distinct queues, one used for kernel execution and another for reads/writes.
If the host enqueues everything and then sits back waiting for all results, the execution time is about 10-12 seconds.
If the host enqueus one iteration worth of commands and waits until they finish before starting the next iteration, the execution time is about 1.3 seconds.
Why such a differance? Are the multiple enqueues slowing the kernel executions?
Setting the GPU_ASYNC_MEM_COPY variable did not change the execution times.
FTR, the GPU on which the execution took place is part of a HD4850.
It is difficult to say what is wrong. If you are using Visual Studio I suggest you use AMD APP Profiler. It will draw an execution timeline and illustrate which command is executed at any given moment.
If you are not using Visual Studio you can use sprofile.exe. It is the command line back-end of AMD APP Profiler.
I've use AMD APP Profiler, which resulted in the following execution trace.
It seams that the creation and release of buffers is slowing down the application. In turn, this causes the enqueueNDRangeKernel function to slow down as well.
The GPU is idle most of the time. This behaviour seams to show that AMD APP is not cabable of dealing with too much concurrency at the API level.
I wonder if this behaviour is only bound to AMD's OpenCL implementation, or is also present in other vendors.
1 of 1 people found this helpful
Why are you not caching mem objects on the host side, or using double buffering and why are you releasing mem objects so often? Releasing mem objects is not supposed to be a fast operation. When you release a mem object, we have to go and destroy its associated memory on the device. Since we do lazy initialization of memory, and you deleted all the old memory with the release mem object, we must re-allocate on every kernel launch.
I see. From the trace I kinda guessed that the release of memory objects was very heavy, since it causes the GPU to become idle.
If I create an N-Buffer like scheme, where the GPU is executing on buffer j while I am loading data to other buffers, should I only enqueue one Kernel at a time? Or can I enqueue N kernels? I know that kernel concurrency is not available in AMD's OpenCL implementation, I am just wondering if enqueuing multiple kernels will have a negative effect on performance.
Thanks for everyone's replies =D
You can pipeline kernel execution, kernel transfer and kernel creation. We have apps that do this and are very efficient.
So basically what you would do is something like this:
setup buffer N & N + 1
enqueue buffer N
enqueue kernel N
setup buffer N + 2
enqueue Buffer N + 1
enqueue kernel N + 1
readback buffer N
setup buffer N + 3
enqueue Buffer N + 2
enqueue kernel N + 2
readback buffer N + 1
In this scenario, I should either use two command queues(execution and read/write queues), or just one out-of-order queue?
Use two queue's, we don't support out of order queue's.
I have the same question as Tim, but i'm using 7970. BTW, can Read/Write Rect functions also be overlapped? It will help if you can provide some working codes examples.
I'm trying to implement exactly the same scenario. It works quite well in CUDA, but I can't make it work with an AMD GPU (5850). Probably I'm doing something wrong (e. g. not calling clFlush in the right moments). I have also heard that overlapping is automatically disable in profiler, is it correct? Will it also be disabled in queues created with CL_QUEUE_PROFILING_ENABLE?
Do you have any working code samples of such a pipeline? Also, with SDK 2.7, is it still required to set GPU_ASYNC_MEM_COPY to 2?