13 Replies Latest reply on Nov 26, 2012 2:10 PM by shawnccx

    Command Queue performance

    rj.marques

      Greatings

       

      Please consider the following scenario:

       

      A thread pool comprised of N host-threads, each responsible for an independant OpenCL execution, i.e., managing memory objects, a kernel, etc.

      Regarding command queues, in this scenario, what is more efficient?

       

      - Having a shared queue, on which every thread issues commands for its execution.

       

      - Having a out of order shared command queue, on which every thread issues commands for its execution, and internally synchronizes via events.

       

      - Having N command queues, one for each thread.

       

      In my experience, having multiple command queues may hamper performance if the number of queues scales. On the other hand, for what i've read, AMD's OpenCL does not give support to concurrent kernels.

       

      Thanks in advance for your opinions. =D

        • Re: Command Queue performance
          tzachi.cohen

          To gain better utilization of the GPU it is advised to run several queues concurrently. Using multiple queues concurrently may reduce GPU idle time, sometimes referred to as GPU bubbles.

          1 of 1 people found this helpful
            • Re: Command Queue performance
              rj.marques

              Ok, I thought so to.

               

              Another question. Having multiple queues offers the possibility of queuing multiple concurrent reads from and writes to, OpenCL device memory. Does queuing multiple data-transfers increase GPU idle time? Or are the the transfers done in parallel with kernel execution?

                • Re: Command Queue performance
                  tzachi.cohen

                  Async copies allow memory transfer operations to execute in parallel to kernel execution. On SDK 2.6 the feature is in preview mode. (set environment variable GPU_ASYNC_MEM_COPY=2 to enable).

                   

                  Queuing multiple data transfer commands does not necessarily cause GPU bubbles. Most of the time it is due to issuing blocking commands such as 'clFinish' or blocking clEnqueueMap.

                    • Re: Command Queue performance
                      rj.marques

                      I've created an aplication where the host does the following steps hundreds of times:

                       

                      1 - Copy data to device memory (Blocking)

                      2 - Enqueue NDRange (Non blocking)

                      3 - Read from device memory (Non blocking)

                      5 - Notify the host, using callbacks

                      4 - Clean up

                       

                      There are two distinct queues, one used for kernel execution and another for reads/writes.

                       

                      If the host enqueues everything and then sits back waiting for all results, the execution time is about 10-12 seconds.

                      If the host enqueus one iteration worth of commands and waits until they finish before starting the next iteration, the execution time is about 1.3 seconds.

                       

                      Why such a differance? Are the multiple enqueues slowing the kernel executions?

                       

                      Setting the GPU_ASYNC_MEM_COPY variable did not change the execution times.

                       

                      FTR, the GPU on which the execution took place is part of a HD4850.

                • Re: Command Queue performance
                  tzachi.cohen

                  It is difficult to say what is wrong. If you are using Visual Studio I suggest you use AMD APP Profiler. It will draw an execution timeline and illustrate which command is executed at any given moment.

                  If you are not using Visual Studio you can use sprofile.exe. It is the command line back-end of AMD APP Profiler.

                    • Re: Command Queue performance
                      rj.marques

                      I've use AMD APP Profiler, which resulted in the following execution trace.

                       

                      It seams that the creation and release of buffers is slowing down the application. In turn, this causes the enqueueNDRangeKernel function to slow down as well.

                       

                      The GPU is idle most of the time. This behaviour seams to show that AMD APP is not cabable of dealing with too much concurrency at the API level.

                       

                      I wonder if this behaviour is only bound to AMD's OpenCL implementation, or is also present in other vendors.

                      profile.jpg

                        • Re: Command Queue performance
                          MicahVillmow

                          rj.marques,

                          Why are you not caching mem objects on the host side, or using double buffering and why are you releasing mem objects so often? Releasing mem objects is not supposed to be a fast operation. When you release a mem object, we have to go and destroy its associated memory on the device. Since we do lazy initialization of memory, and you deleted all the old memory with the release mem object, we must re-allocate on every kernel launch.

                          1 of 1 people found this helpful
                            • Re: Command Queue performance
                              rj.marques

                              I see. From the trace I kinda guessed that the release of memory objects was very heavy, since it causes the GPU to become idle.

                               

                              If I create an N-Buffer like scheme, where the GPU is executing on buffer j while I am loading data to other buffers, should I only enqueue one Kernel at a time? Or can I enqueue N kernels? I know that kernel concurrency is not available in AMD's OpenCL implementation, I am just wondering if enqueuing multiple kernels will have a negative effect on performance.

                               

                              Thanks for everyone's replies =D