47 Replies Latest reply on Feb 9, 2012 3:43 PM by jeff_golds

    clEnqueueNDRangeKernel(), different function call time on CPU and GPU

    eklund.n
      not the actual execution time, just the function call

      Hello!

      I get very high times on the clEnqueueNDRangeKernel() call when using GPU. Look at the code below. I use CL_QUEUE_PROFILING_ENABLE on the command queue to do profiling on 'event', and (end-start) to get time of function call. 

      On CPU:
      CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.729372 ms
      end - start = 0.003000 ms

      On GPU:
      CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.797341 ms
      end - start = 4.194000 ms

      If I use big global sizes this extra time is irrelevant, and GPU is faster than CPU. But I want my implementation to be fast on GPU even at small global sizes, and this big function call time stops that. Am I doing something wrong?

      clFinish(hd->command_queue); gettimeofday(&start, NULL); status = clEnqueueNDRangeKernel(hd->command_queue, hd->kernel, 1, NULL, &globalSize, &localSize, 0, NULL, event); gettimeofday(&end, NULL); clFinish(hd->command_queue);

        • clEnqueueNDRangeKernel(), different function call time on CPU and GPU
          nou

          there is some overhead to launch kernel. according to OpenCL programing guide it is 25 vs 225 microsecond.

          i recomend read this section 4.9.1 http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf

          • clEnqueueNDRangeKernel(), different function call time on CPU and GPU
            dravisher

            That does sound pretty excessive. I went ahead and did a similar measurement, and I'm clocking ~10 microseconds for both the CPU and GPU (Cypress). This is on a Windows 7 system using QueryPerformanceCounter to measure the time, which should be very accurate.

            • clEnqueueNDRangeKernel(), different function call time on CPU and GPU
              MicahVillmow
              We have improved the kernel launch time and the improvements will show up in our next release. However, this is not a stopping point and future releases will have more improvements.
              • clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                MicahVillmow
                eklund.n,
                This is something we have been working on, but also a large portion of the valgrind errors are false positives. The reason is that valgrind doesn't handle the case where the allocation is done in user space and the deallocation is done in kernel space. Since valgrind doesn't see the deallocation, it assumes that it is lost memory.
                • clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                  MicahVillmow
                  yes it will be improved in SDK 2.3.
                  • clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                    MicahVillmow
                    Ravi,
                    We don't specify the exact release date it will come out, but it will be in December.
                    • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                      joker-eph

                      What's up on this topic ? I'm coming from the CUDA world and trying to play around with my Radeon 6970. I'm a bit surprised with this clEnqueueNDRangeKernel issue !

                       

                      I installed the 2.6 SDK over Catalyst 11.1, and while a small saxpy kernel runs a little bit faster than on my Tesla C2050 (less than 1ms) according to the OpenCL event profiling, the overhead of clEnqueueNDRangeKernel is 23ms ! That's huge for an asynchronous call, it kills any chance of getting performance.

                       

                      I installed today the 12.1 release of the Catalyst driver over the 2.6 SDK, and it doesn't change anything. When I select my CPU device, the call to clEnqueueNDRangeKernel is really asynchronous and takes nothing. On my Nvidia platform it's around 0.1ms.

                       

                      My configuration is :

                       

                      - Ubuntu 11.10 64bits

                      - AMD FX-6100

                      - 8GB RAM

                      - AMD Radeon HD 6970 Cayman

                      Version information from clinfo :

                       

                      Device OpenCL C version : OpenCL C 1.1

                      Driver version: CAL 1.4.1664

                      Version: OpenCL 1.1 AMD-APP (851.4)

                        • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                          jeff_golds

                          The issue you are likely seeing is that your commands are sitting in a queue and not being flushed.  Try the following:

                          - Enqueue your commands, saving final event in queue

                          - Call clFlush() for your command queue

                          - Then call clGetEventInfo(), checking CL_EVENT_COMMAND_EXECUTION_STATUS until the event changes to submitted

                            • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                              joker-eph

                              Hi Jeff,

                               

                              Thanks for your answer, but I can't understand how should I call clFlush *while* my call to clEnqueueNDRangeKernel is blocking !! Do you mean I should call clFlush *before* calling clEnqueueNDRangeKernel ?

                               

                              My pattern is : I enqueue a sequence of  1) write buffer 2) enqueue kernel 3) read buffer ; each of these steps is followed by a waitForEvent() ; which implies the queue is empty when I submit the next command.

                               

                              To be clear: the more-than-20ms is the time of clEnqueueNDRangeKernel call solely, not anything related to the time between I submit to the queue and the time the command will be issued from the queue for execution.

                               

                              Thanks,

                               

                              Mehdi

                                • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                  jeff_golds

                                  clEnqueueNDRangeKernel is not a blocking call, and can't be.  Blocking calls are clWaitForEvents, clFinish, clEnqueueReadBuffer/clEnqueueWriteBuffer/clMapBuffer/etc. with blocking flag set.  Calling clEnqueueNDRangeKernel just adds (enqueues) the command to the command queue.

                                    • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                      joker-eph

                                      Jeff,

                                       

                                      This is exactly what I complain for ! :-)

                                       

                                      clEnqueueNDRangeKernel() should be asynchronous (and, ok, *is* asynchronous), but when called on a queue mapped to a GPU device it has an heavy latency, which I abusively called "blocking".

                                       

                                      My code is :

                                       

                                      timer_start(); // gettimeofday, initialize internal counter

                                      clEnqueueNDRangeKernel(...);

                                      timer_stop_display(); // gettimeofday, compare to internal counter and display

                                       

                                      I get more than 20ms for this single call, but only if the queue is associated to the GPU. If a chose the CPU device it takes "no" time.

                                       

                                      If this is an unknown issue for you, then I may have an issue with my setup. I should begin writing you a small test case to reproduce it.

                                        • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                          jeff_golds

                                          If you never flush the command queue, how can the device start processing the commands?  Without clFlush() your commands will sit in the command queue.  AFAIK, nvidia doesn't support threading for the command queue, so you won't see this problem there.

                                            • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                              joker-eph

                                              I can't see your point with threading ? I don't have any host thread...

                                               

                                              I do not believe I *have to call* clFlush for execution (it works well without...), and I can't find any reference in the standard about that.

                                               

                                              I'm writing a small test case for reproduce the issue outside of my framework.

                                                • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                                  jeff_golds

                                                  The point with threading is that you might have multiple command queues.  If enqueuing a kernel to one command queue was blocking, then you wouldn't be able to overlap execution with work in the other command queue.

                                                   

                                                  If you want to start execution *now* then you had better call clFlush() else the commands will get process at the driver's leisure

                                                    • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                                      joker-eph

                                                      I wrote a small test case for reproduction :http://pastebin.com/fije3CKf ; save it to a C file and compile it with :

                                                       

                                                      gcc my_source_file.c -L /opt/AMDAPP/lib/x86_64/ -l OpenCL -I /opt/AMDAPP/include/ -o test_latency

                                                       

                                                      After allocating OpenCL stuff in the main(), the function openclSimpleCopy() is called and executes always this sequence:

                                                       

                                                      1. Initialize 2 host buffers on the stack, "a" and "b", size is "n"
                                                      2. Allocate 2 OpenCL buffers on the device, "a_dev" and "b_dev"
                                                      3. Write host buffer "a" to the OpenCL mem object "a_dev" (synchronous write, followed by a clWaitForEvents(), so the queue is then empty)
                                                      4. Run a kernel that to a simple copy b[i]=a[i] ; print the time used just to enqueue the kernel asynchronously
                                                      5. Read from the second "dev_b" OpenCL mem object to host "b"
                                                      6. Check that a[i]==b[i] for 0<i<n
                                                      7. Release the OpenCL mem object

                                                       

                                                      The interesting thing is: if it is called with n > 10^6 ; the kernel enqueue will take time, else it's almost instantaneous

                                                       

                                                      At the beginning of the main you can change the device id to chose (or change CL_DEVICE_TYPE_ALL to CL_DEVICE_TYPE_CPU or CL_DEVICE_TYPE_GPU).

                                                       

                                                      If a select the CPU with AMD APP 2.6:

                                                      $ ./test_latency

                                                      Run with n = 10000000

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Run with n = 100000

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                       

                                                      If I select my AMD GPU :

                                                      ~$ ./test_latency

                                                      Run with n = 10000000

                                                      Time for Enqueue : 24.7

                                                      Time for Enqueue : 15.7

                                                      Time for Enqueue : 19.2

                                                      Time for Enqueue : 18.4

                                                      Time for Enqueue : 18.3

                                                      Time for Enqueue : 19.3

                                                      Time for Enqueue : 18.3

                                                      Time for Enqueue : 18.3

                                                      Time for Enqueue : 19.2

                                                      Time for Enqueue : 18.3

                                                      Run with n = 100000

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.1

                                                      Time for Enqueue : 0.1

                                                      Time for Enqueue : 0.1

                                                      Time for Enqueue : 0.0

                                                       

                                                      On an Nvidia 4.1 runtime :

                                                      $ ./test_latency

                                                      ./test_latency: /usr/lib/libOpenCL.so.1: no version information available (required by ./test_latency)

                                                      Run with n = 10000000

                                                      Time for Enqueue : 0.4

                                                      Time for Enqueue : 0.2

                                                      Time for Enqueue : 0.2

                                                      Time for Enqueue : 0.2

                                                      Time for Enqueue : 0.2

                                                      Time for Enqueue : 0.2

                                                      Time for Enqueue : 0.2

                                                      Time for Enqueue : 0.2

                                                      Time for Enqueue : 0.2

                                                      Time for Enqueue : 0.2

                                                      Run with n = 100000

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                      Time for Enqueue : 0.0

                                                        • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                                          nou

                                                          i added same clEnqueueNDRangeKernel() before your time measurment. then i got 0.0 in both cases. IMHO OpenCL runtime transfer your buffers beforehand. so i don't see issue here.

                                                            • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                                              jeff_golds

                                                              Thanks for checking that, nou, I had issues accessing the source code due to firewall restrictions.

                                                               

                                                              One thing to note is that our OpenCL implementation uses deferred allocation so we don't allocate resources until first use.  On top of that, VidMM in Windows also doesn't migrate allocations to a device until the device accesses it.  In your case, joker-eph, it seems likely you're measuring the overhead of VidMM migrating resources to the GPU.  The larger the resource, the longer it will take VidMM to migrate the resource as it first zeroes the memory due to content protection.

                                                                • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                                                  joker-eph

                                                                  I don't know what VidMM stands for ? Video Memory Management ? By the way I'm on linux.

                                                                   

                                                                  So in your implementation when I call enqueueWriteBuffer() on a queue associated to a GPU, the runtime only gets a copy of my data on the host and it will really send the data to the GPU when a kernel will need it ?

                                                                   

                                                                  Then, if I understand well I cannot measure in any way the transfer of data from the host to the device, since event if I wait on the event associated to enqueueWriteBuffer() I will time only the caching of the data by the runtime ?

                                                                   

                                                                  It also mean that I cannot transfer data "in advance" for a later use on the GPU ?

                                                                   

                                                                  This is very interesting to know ! In any case I guess it should be handled by the runtime and should "block" clEnqueueNDRangeKernel() in any way. I mean, even if there is some memory to manage before launching the kernel, here it's only an asynchronous call to enqueue a kernel for a future launch.

                                                                   

                                                                  Thanks !

                                                                   


                                                                    • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                                                      jeff_golds

                                                                      If you use clEnqueueWriteBuffer, then it will depend on some factors.  First, if the buffer is in host memory, then it will be copied immediately (if not currently in use).  If the buffer is on the GPU, then we will schedule a copy.  If the command is blocking, then it will happen immediately, but you may have to wait for previous commands to complete.

                                                                       

                                                                      Again, I can't access your code right now, so I don't know exactly what you are doing.  But it sounds to me like you are measuring some other cost, not the cost of the clEnqueueNDRangeKernel().

                                                                        • Re: clEnqueueNDRangeKernel(), different function call time on CPU and GPU
                                                                          joker-eph

                                                                          I don't know what you mean exactly by "not the cost of the clEnqueueNDRangeKernel()." ?

                                                                          Anyway here is the relevant part of the code, showing that I create an OpenCL buffer, I call clEnqueueWriteBuffer() on a queue associated to the GPU (I tested synchronous or asynchronous), then call clFlush() and clWaitForEvents(). Finally I time the call to clEnqueueWriteBuffer().

                                                                           

                                                                           

                                                                            // Buffers on the device
                                                                            cl_mem a_dev = clCreateBuffer(context,
                                                                                                          CL_MEM_READ_WRITE,
                                                                                                          n * sizeof(int),
                                                                                                          NULL,
                                                                                                          &cl_error);
                                                                            OpenCL_test_execution("Create Buffer",cl_error);
                                                                          
                                                                            cl_mem b_dev = clCreateBuffer(context,
                                                                                                          CL_MEM_READ_WRITE,
                                                                                                          n * sizeof(int),
                                                                                                          NULL,
                                                                                                          &cl_error);
                                                                            OpenCL_test_execution("Create Buffer",cl_error);
                                                                          
                                                                            // 3 events is enough here
                                                                            cl_event event1;
                                                                            cl_event event2;
                                                                            cl_event event3;
                                                                          
                                                                            // Initialize buffer on the device
                                                                            cl_error =  clEnqueueWriteBuffer(queue,
                                                                                                 a_dev,
                                                                                                 CL_TRUE,
                                                                                                 0,
                                                                                                 n * sizeof(int),
                                                                                                 a,
                                                                                                 0,
                                                                                                 NULL,
                                                                                                 &event1);
                                                                            OpenCL_test_execution("Write to Buffer",cl_error);
                                                                          
                                                                            // Shouldn't be useful, I used a blocking write !
                                                                            clFlush(queue);
                                                                            clWaitForEvents(1,&event1); 
                                                                          
                                                                            // Arguments for the kernel
                                                                            cl_error = clSetKernelArg(kernel,0,sizeof(a_dev), &a_dev);
                                                                            OpenCL_test_execution("Set argument 0 ",cl_error);
                                                                          
                                                                            cl_error = clSetKernelArg(kernel,1,sizeof(b_dev), &b_dev);
                                                                            OpenCL_test_execution("Set argument 1",cl_error);
                                                                          
                                                                            timer_start();
                                                                            cl_error = clEnqueueNDRangeKernel(queue,
                                                                                                              kernel,
                                                                                                              1,
                                                                                                              NULL,
                                                                                                              &n,
                                                                                                              NULL,
                                                                                                              1,
                                                                                                              &event1,
                                                                                                              &event2); 
                                                                            timer_stop_display("Time for Enqueue");