cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

eklund_n
Journeyman III

clEnqueueNDRangeKernel(), different function call time on CPU and GPU

not the actual execution time, just the function call

Hello!

I get very high times on the clEnqueueNDRangeKernel() call when using GPU. Look at the code below. I use CL_QUEUE_PROFILING_ENABLE on the command queue to do profiling on 'event', and (end-start) to get time of function call. 

On CPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.729372 ms
end - start = 0.003000 ms

On GPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.797341 ms
end - start = 4.194000 ms

If I use big global sizes this extra time is irrelevant, and GPU is faster than CPU. But I want my implementation to be fast on GPU even at small global sizes, and this big function call time stops that. Am I doing something wrong?

clFinish(hd->command_queue); gettimeofday(&start, NULL); status = clEnqueueNDRangeKernel(hd->command_queue, hd->kernel, 1, NULL, &globalSize, &localSize, 0, NULL, event); gettimeofday(&end, NULL); clFinish(hd->command_queue);

0 Likes
47 Replies
nou
Exemplar

there is some overhead to launch kernel. according to OpenCL programing guide it is 25 vs 225 microsecond.

i recomend read this section 4.9.1 http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf

0 Likes

Shure, there is overhead to launch kernel. But over 4 ms just to return from clEnqueueNDRangeKernel()?! That's more than some execution times.

I've read the entire Programming Guide previously.

0 Likes

Can list the system:

i7 950
Gigabyte GA-X58A-UD3R
6 GB RAM
HD5870 (x2) 

0 Likes
dravisher
Journeyman III

That does sound pretty excessive. I went ahead and did a similar measurement, and I'm clocking ~10 microseconds for both the CPU and GPU (Cypress). This is on a Windows 7 system using QueryPerformanceCounter to measure the time, which should be very accurate.

0 Likes

exactly. I'm running Ubuntu 10.04.1 x86_64 and stream 2.2, catalyst 10.9 hotfix.

can't understand what causes this. anyone?

0 Likes

eklund.n,

I think you are getting reasonable time values from clGetProfilingInfo API.

IT might be the case that the function gettimeofday() is taking considerable time.The time values returned by openCL APIs seems more trustworthy as they are created specifically for measuring time interval.

0 Likes

ok. that can be a thought that the gettimeofday() function takes time. but it doesn't do that at any other place. I think that the gettimeofday() reports the right time. cause if I enqueue larger works that takes up to 500 ms, the clEnqueueNDRangeKernel() still returns after some 4 ms. so the time from profiling info isn't really related. I will nonetheless test with other external timing functions.

 

I will also try at a system with hd5850, running both windows 7 and linux. to see if it's a setup problem and not my code. thanks for minding my troubles.

0 Likes

i have oposite experience as gettimeofday() return about 1-5% lower time than OpenCL profiling.

0 Likes

now I have tested a counter with higher resolution, but still get > 4ms call time on GPU. the call for CPU takes 0.003858 ms.

will test on other system soon.

edit -> have run this on ubuntu10.04x86_64/nVidia-9400/e8500/nvidia-cuda-sdk, where the function call returned after 0.0044 ms for GPU. something is wrong with my original system setup, but what?

edit 2 -> have tested on ubuntu10.04x86_64/hd5850/i5-750/ati-stream-sdk, where the function call return after 2.712602 ms for GPU and 0.009748 ms for CPU.

... struct timespec start, end; double elapsedTime; ... clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); status = clEnqueueNDRangeKernel(hd->command_queue, hd->kernel, 1, NULL, &globalSize, &localSize, 0, NULL, &event); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); ... elapsedTime = (end.tv_sec - start.tv_sec) * 1000.0 + (end.tv_nsec - start.tv_nsec) / 1000000.0; printf("The EnqueuNDRangeKernel call took %f ms\n", elapsedTime); ...

0 Likes

I can't draw any conclusions other than that it is a slow SDK implementation for GPU on Linux from ATI since that is where the time differ. (ATI Windows ~10 µs according to dravisher, ATI CPU Linux ~5 µs, nVidia GPU Linux ~5 µs but ATI GPU Linux ~5 ms)

Can someone else confirm? Developers?

0 Likes

I have this problem too,

My system is x58 ASUS SuperComputer , i7, 12GB DDR3 1333MHz , 2x ATI HD5870, NVIDIA GTX470, OpenSUSE 11.2 x64 2.6.31.5 ,ATI driver 10.10, 2.2 StreamSDK. ATI GPU  linux take really long time just to execute clEnqueueNDRangeKernel.

0 Likes

We have improved the kernel launch time and the improvements will show up in our next release. However, this is not a stopping point and future releases will have more improvements.
0 Likes

thanks, I hope we will see next release soon.

0 Likes

Thanks for clearing this out. Good job!

Btw. nVidia SDK gave no Valgrind errors, where ATI SDK showed 1000000+ errors. I guess this will also improve as ATI SDK matures.

0 Likes

I'm working on an implementation of Genetic Programming in OpenCL[1]. I've observed that the kernel launch time for the GPU is too slow (~2500x slower!) under the current AMD SDK. Below are the differences I'm getting when compared to the Nvidia's OpenCL implementation:

- Machine 1: AMD SDK v2.2, fglrx 10.11 driver, ATI Radeon HD5750, GNU/Linux 2.6.36 64-bit, Intel Xeon W3680:

  Average kernel launch time (GPU): 0.0426978s (~42ms)
  Average kernel launch time (CPU): 0.00106843s (~1ms)

- Machine 2: Nvidia GTX 285, GNU/Linux 2.6.31 64-bit, Intel Core i7 920

  Average kernel launch time (GPU): 0.0000179311s (~0.018ms)

* These measures were obtained via "getProfilingInfo" as described in "ATI Programming Guide - OpenCL".

Will the upcoming AMD SDK (v2.3) fix--or at least greatly improve--the kernel launch time?


1. http://sourceforge.net/projects/gpocl/

0 Likes

i can enqueue ~10000x kernel launch per second (thougth very simple kernel on small dataset) on my box. try enqueue multiple actions and then call clFinish

0 Likes

@Micah, Could you please tell us when can we expect the SDK 2.3. The delay in command launch time and unsupported DMA is making things really difficult for us.

0 Likes

Stream SDK are released in three months interval. IMHO you can expect next release in december. this is outside speculation. nobody from AMD will tell when.

0 Likes

same here! i've a kernel which i launch 8 times. With GPU as device (HD5770) the clEnqueueNDRangeKernel takes ~10ms the first time, but only ~0.01ms the 7 next (each). With the CPU (Athlon II X2 250) all 8 launchs take only ~0.01ms each. The kernel exec time is ~1ms for GPU and ~80ms for the CPU. (as a side note, clEnqueueNDRangeKernel command elapsed time is taken with QueryPerformanceCounter(), while the kernel time is using clGetEventProfilingInfo(). The OS is Win7 Pro 64bits).

0 Likes

eklund.n,
This is something we have been working on, but also a large portion of the valgrind errors are false positives. The reason is that valgrind doesn't handle the case where the allocation is done in user space and the deallocation is done in kernel space. Since valgrind doesn't see the deallocation, it assumes that it is lost memory.
0 Likes

aha.

but this still makes it hard to test the OpenCL application for memory leaks. is there a better tool than Valgrind?

0 Likes

yes it will be improved in SDK 2.3.
0 Likes

Ravi,
We don't specify the exact release date it will come out, but it will be in December.
0 Likes

Thanks Micah, Looking forward to it

0 Likes
joker-eph
Journeyman III

What's up on this topic ? I'm coming from the CUDA world and trying to play around with my Radeon 6970. I'm a bit surprised with this clEnqueueNDRangeKernel issue !

I installed the 2.6 SDK over Catalyst 11.1, and while a small saxpy kernel runs a little bit faster than on my Tesla C2050 (less than 1ms) according to the OpenCL event profiling, the overhead of clEnqueueNDRangeKernel is 23ms ! That's huge for an asynchronous call, it kills any chance of getting performance.

I installed today the 12.1 release of the Catalyst driver over the 2.6 SDK, and it doesn't change anything. When I select my CPU device, the call to clEnqueueNDRangeKernel is really asynchronous and takes nothing. On my Nvidia platform it's around 0.1ms.

My configuration is :

- Ubuntu 11.10 64bits

- AMD FX-6100

- 8GB RAM

- AMD Radeon HD 6970 Cayman

Version information from clinfo :

Device OpenCL C version : OpenCL C 1.1

Driver version: CAL 1.4.1664

Version: OpenCL 1.1 AMD-APP (851.4)

0 Likes

The issue you are likely seeing is that your commands are sitting in a queue and not being flushed.  Try the following:

- Enqueue your commands, saving final event in queue

- Call clFlush() for your command queue

- Then call clGetEventInfo(), checking CL_EVENT_COMMAND_EXECUTION_STATUS until the event changes to submitted

0 Likes

Hi Jeff,

Thanks for your answer, but I can't understand how should I call clFlush *while* my call to clEnqueueNDRangeKernel is blocking !! Do you mean I should call clFlush *before* calling clEnqueueNDRangeKernel ?

My pattern is : I enqueue a sequence of  1) write buffer 2) enqueue kernel 3) read buffer ; each of these steps is followed by a waitForEvent() ; which implies the queue is empty when I submit the next command.

To be clear: the more-than-20ms is the time of clEnqueueNDRangeKernel call solely, not anything related to the time between I submit to the queue and the time the command will be issued from the queue for execution.

Thanks,

Mehdi

0 Likes

clEnqueueNDRangeKernel is not a blocking call, and can't be.  Blocking calls are clWaitForEvents, clFinish, clEnqueueReadBuffer/clEnqueueWriteBuffer/clMapBuffer/etc. with blocking flag set.  Calling clEnqueueNDRangeKernel just adds (enqueues) the command to the command queue.

0 Likes

Jeff,

This is exactly what I complain for ! 🙂

clEnqueueNDRangeKernel() should be asynchronous (and, ok, *is* asynchronous), but when called on a queue mapped to a GPU device it has an heavy latency, which I abusively called "blocking".

My code is :

timer_start(); // gettimeofday, initialize internal counter

clEnqueueNDRangeKernel(...);

timer_stop_display(); // gettimeofday, compare to internal counter and display

I get more than 20ms for this single call, but only if the queue is associated to the GPU. If a chose the CPU device it takes "no" time.

If this is an unknown issue for you, then I may have an issue with my setup. I should begin writing you a small test case to reproduce it.

0 Likes

If you never flush the command queue, how can the device start processing the commands?  Without clFlush() your commands will sit in the command queue.  AFAIK, nvidia doesn't support threading for the command queue, so you won't see this problem there.

0 Likes

I can't see your point with threading ? I don't have any host thread...

I do not believe I *have to call* clFlush for execution (it works well without...), and I can't find any reference in the standard about that.

I'm writing a small test case for reproduce the issue outside of my framework.

0 Likes

The point with threading is that you might have multiple command queues.  If enqueuing a kernel to one command queue was blocking, then you wouldn't be able to overlap execution with work in the other command queue.

If you want to start execution *now* then you had better call clFlush() else the commands will get process at the driver's leisure

0 Likes

I wrote a small test case for reproduction :http://pastebin.com/fije3CKf ; save it to a C file and compile it with :

gcc my_source_file.c -L /opt/AMDAPP/lib/x86_64/ -l OpenCL -I /opt/AMDAPP/include/ -o test_latency

After allocating OpenCL stuff in the main(), the function openclSimpleCopy() is called and executes always this sequence:

  1. Initialize 2 host buffers on the stack, "a" and "b", size is "n"
  2. Allocate 2 OpenCL buffers on the device, "a_dev" and "b_dev"
  3. Write host buffer "a" to the OpenCL mem object "a_dev" (synchronous write, followed by a clWaitForEvents(), so the queue is then empty)
  4. Run a kernel that to a simple copy b=a ; print the time used just to enqueue the kernel asynchronously
  5. Read from the second "dev_b" OpenCL mem object to host "b"
  6. Check that a==b for 0<i<n
  7. Release the OpenCL mem object

The interesting thing is: if it is called with n > 10^6 ; the kernel enqueue will take time, else it's almost instantaneous

At the beginning of the main you can change the device id to chose (or change CL_DEVICE_TYPE_ALL to CL_DEVICE_TYPE_CPU or CL_DEVICE_TYPE_GPU).

If a select the CPU with AMD APP 2.6:

$ ./test_latency

Run with n = 10000000

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Run with n = 100000

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

If I select my AMD GPU :

~$ ./test_latency

Run with n = 10000000

Time for Enqueue : 24.7

Time for Enqueue : 15.7

Time for Enqueue : 19.2

Time for Enqueue : 18.4

Time for Enqueue : 18.3

Time for Enqueue : 19.3

Time for Enqueue : 18.3

Time for Enqueue : 18.3

Time for Enqueue : 19.2

Time for Enqueue : 18.3

Run with n = 100000

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.1

Time for Enqueue : 0.1

Time for Enqueue : 0.1

Time for Enqueue : 0.0

On an Nvidia 4.1 runtime :

$ ./test_latency

./test_latency: /usr/lib/libOpenCL.so.1: no version information available (required by ./test_latency)

Run with n = 10000000

Time for Enqueue : 0.4

Time for Enqueue : 0.2

Time for Enqueue : 0.2

Time for Enqueue : 0.2

Time for Enqueue : 0.2

Time for Enqueue : 0.2

Time for Enqueue : 0.2

Time for Enqueue : 0.2

Time for Enqueue : 0.2

Time for Enqueue : 0.2

Run with n = 100000

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

Time for Enqueue : 0.0

0 Likes

i added same clEnqueueNDRangeKernel() before your time measurment. then i got 0.0 in both cases. IMHO OpenCL runtime transfer your buffers beforehand. so i don't see issue here.

0 Likes

Thanks for checking that, nou, I had issues accessing the source code due to firewall restrictions.

One thing to note is that our OpenCL implementation uses deferred allocation so we don't allocate resources until first use.  On top of that, VidMM in Windows also doesn't migrate allocations to a device until the device accesses it.  In your case, joker-eph, it seems likely you're measuring the overhead of VidMM migrating resources to the GPU.  The larger the resource, the longer it will take VidMM to migrate the resource as it first zeroes the memory due to content protection.

0 Likes

I don't know what VidMM stands for ? Video Memory Management ? By the way I'm on linux.

So in your implementation when I call enqueueWriteBuffer() on a queue associated to a GPU, the runtime only gets a copy of my data on the host and it will really send the data to the GPU when a kernel will need it ?

Then, if I understand well I cannot measure in any way the transfer of data from the host to the device, since event if I wait on the event associated to enqueueWriteBuffer() I will time only the caching of the data by the runtime ?

It also mean that I cannot transfer data "in advance" for a later use on the GPU ?

This is very interesting to know ! In any case I guess it should be handled by the runtime and should "block" clEnqueueNDRangeKernel() in any way. I mean, even if there is some memory to manage before launching the kernel, here it's only an asynchronous call to enqueue a kernel for a future launch.

Thanks !


0 Likes

If you use clEnqueueWriteBuffer, then it will depend on some factors.  First, if the buffer is in host memory, then it will be copied immediately (if not currently in use).  If the buffer is on the GPU, then we will schedule a copy.  If the command is blocking, then it will happen immediately, but you may have to wait for previous commands to complete.

Again, I can't access your code right now, so I don't know exactly what you are doing.  But it sounds to me like you are measuring some other cost, not the cost of the clEnqueueNDRangeKernel().

0 Likes

I don't know what you mean exactly by "not the cost of the clEnqueueNDRangeKernel()." ?

Anyway here is the relevant part of the code, showing that I create an OpenCL buffer, I call clEnqueueWriteBuffer() on a queue associated to the GPU (I tested synchronous or asynchronous), then call clFlush() and clWaitForEvents(). Finally I time the call to clEnqueueWriteBuffer().

  // Buffers on the device

  cl_mem a_dev = clCreateBuffer(context,

                                CL_MEM_READ_WRITE,

                                n * sizeof(int),

                                NULL,

                                &cl_error);

  OpenCL_test_execution("Create Buffer",cl_error);

  cl_mem b_dev = clCreateBuffer(context,

                                CL_MEM_READ_WRITE,

                                n * sizeof(int),

                                NULL,

                                &cl_error);

  OpenCL_test_execution("Create Buffer",cl_error);

  // 3 events is enough here

  cl_event event1;

  cl_event event2;

  cl_event event3;

  // Initialize buffer on the device

  cl_error =  clEnqueueWriteBuffer(queue,

                       a_dev,

                       CL_TRUE,

                       0,

                       n * sizeof(int),

                       a,

                       0,

                       NULL,

                       &event1);

  OpenCL_test_execution("Write to Buffer",cl_error);

  // Shouldn't be useful, I used a blocking write !

  clFlush(queue);

  clWaitForEvents(1,&event1);

  // Arguments for the kernel

  cl_error = clSetKernelArg(kernel,0,sizeof(a_dev), &a_dev);

  OpenCL_test_execution("Set argument 0 ",cl_error);

  cl_error = clSetKernelArg(kernel,1,sizeof(b_dev), &b_dev);

  OpenCL_test_execution("Set argument 1",cl_error);

  timer_start();

  cl_error = clEnqueueNDRangeKernel(queue,

                                    kernel,

                                    1,

                                    NULL,

                                    &n,

                                    NULL,

                                    1,

                                    &event1,

                                    &event2);

  timer_stop_display("Time for Enqueue");

0 Likes

If you make clEnqueueWriteBuffer() blocking, as you've done, then there's no need to wait for events.  Similarly, there's no need to pass in that event as a dependency for clEnqueueNDRangeKernel.

In your case, b_dev won't be allocated until the kernel dispatch.  You can avoid this by touching the resource with the device before you start your timer.

Warming up buffers/kernels is a common technique as then you remove one-time costs from measurements.

0 Likes