Hello!
I get very high times on the clEnqueueNDRangeKernel() call when using GPU. Look at the code below. I use CL_QUEUE_PROFILING_ENABLE on the command queue to do profiling on 'event', and (end-start) to get time of function call.
On CPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.729372 ms
end - start = 0.003000 ms
On GPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.797341 ms
end - start = 4.194000 ms
If I use big global sizes this extra time is irrelevant, and GPU is faster than CPU. But I want my implementation to be fast on GPU even at small global sizes, and this big function call time stops that. Am I doing something wrong?
clFinish(hd->command_queue); gettimeofday(&start, NULL); status = clEnqueueNDRangeKernel(hd->command_queue, hd->kernel, 1, NULL, &globalSize, &localSize, 0, NULL, event); gettimeofday(&end, NULL); clFinish(hd->command_queue);
there is some overhead to launch kernel. according to OpenCL programing guide it is 25 vs 225 microsecond.
i recomend read this section 4.9.1 http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf
Shure, there is overhead to launch kernel. But over 4 ms just to return from clEnqueueNDRangeKernel()?! That's more than some execution times.
I've read the entire Programming Guide previously.
Can list the system:
i7 950
Gigabyte GA-X58A-UD3R
6 GB RAM
HD5870 (x2)
That does sound pretty excessive. I went ahead and did a similar measurement, and I'm clocking ~10 microseconds for both the CPU and GPU (Cypress). This is on a Windows 7 system using QueryPerformanceCounter to measure the time, which should be very accurate.
exactly. I'm running Ubuntu 10.04.1 x86_64 and stream 2.2, catalyst 10.9 hotfix.
can't understand what causes this. anyone?
eklund.n,
I think you are getting reasonable time values from clGetProfilingInfo API.
IT might be the case that the function gettimeofday() is taking considerable time.The time values returned by openCL APIs seems more trustworthy as they are created specifically for measuring time interval.
ok. that can be a thought that the gettimeofday() function takes time. but it doesn't do that at any other place. I think that the gettimeofday() reports the right time. cause if I enqueue larger works that takes up to 500 ms, the clEnqueueNDRangeKernel() still returns after some 4 ms. so the time from profiling info isn't really related. I will nonetheless test with other external timing functions.
I will also try at a system with hd5850, running both windows 7 and linux. to see if it's a setup problem and not my code. thanks for minding my troubles.
i have oposite experience as gettimeofday() return about 1-5% lower time than OpenCL profiling.
now I have tested a counter with higher resolution, but still get > 4ms call time on GPU. the call for CPU takes 0.003858 ms.
will test on other system soon.
edit -> have run this on ubuntu10.04x86_64/nVidia-9400/e8500/nvidia-cuda-sdk, where the function call returned after 0.0044 ms for GPU. something is wrong with my original system setup, but what?
edit 2 -> have tested on ubuntu10.04x86_64/hd5850/i5-750/ati-stream-sdk, where the function call return after 2.712602 ms for GPU and 0.009748 ms for CPU.
... struct timespec start, end; double elapsedTime; ... clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); status = clEnqueueNDRangeKernel(hd->command_queue, hd->kernel, 1, NULL, &globalSize, &localSize, 0, NULL, &event); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); ... elapsedTime = (end.tv_sec - start.tv_sec) * 1000.0 + (end.tv_nsec - start.tv_nsec) / 1000000.0; printf("The EnqueuNDRangeKernel call took %f ms\n", elapsedTime); ...
I can't draw any conclusions other than that it is a slow SDK implementation for GPU on Linux from ATI since that is where the time differ. (ATI Windows ~10 µs according to dravisher, ATI CPU Linux ~5 µs, nVidia GPU Linux ~5 µs but ATI GPU Linux ~5 ms)
Can someone else confirm? Developers?
I have this problem too,
My system is x58 ASUS SuperComputer , i7, 12GB DDR3 1333MHz , 2x ATI HD5870, NVIDIA GTX470, OpenSUSE 11.2 x64 2.6.31.5 ,ATI driver 10.10, 2.2 StreamSDK. ATI GPU linux take really long time just to execute clEnqueueNDRangeKernel.
thanks, I hope we will see next release soon.
Thanks for clearing this out. Good job!
Btw. nVidia SDK gave no Valgrind errors, where ATI SDK showed 1000000+ errors. I guess this will also improve as ATI SDK matures.
I'm working on an implementation of Genetic Programming in OpenCL[1]. I've observed that the kernel launch time for the GPU is too slow (~2500x slower!) under the current AMD SDK. Below are the differences I'm getting when compared to the Nvidia's OpenCL implementation:
- Machine 1: AMD SDK v2.2, fglrx 10.11 driver, ATI Radeon HD5750, GNU/Linux 2.6.36 64-bit, Intel Xeon W3680:
Average kernel launch time (GPU): 0.0426978s (~42ms)
Average kernel launch time (CPU): 0.00106843s (~1ms)
- Machine 2: Nvidia GTX 285, GNU/Linux 2.6.31 64-bit, Intel Core i7 920
Average kernel launch time (GPU): 0.0000179311s (~0.018ms)
* These measures were obtained via "getProfilingInfo" as described in "ATI Programming Guide - OpenCL".
Will the upcoming AMD SDK (v2.3) fix--or at least greatly improve--the kernel launch time?
1. http://sourceforge.net/projects/gpocl/
i can enqueue ~10000x kernel launch per second (thougth very simple kernel on small dataset) on my box. try enqueue multiple actions and then call clFinish
@Micah, Could you please tell us when can we expect the SDK 2.3. The delay in command launch time and unsupported DMA is making things really difficult for us.
Stream SDK are released in three months interval. IMHO you can expect next release in december. this is outside speculation. nobody from AMD will tell when.
same here! i've a kernel which i launch 8 times. With GPU as device (HD5770) the clEnqueueNDRangeKernel takes ~10ms the first time, but only ~0.01ms the 7 next (each). With the CPU (Athlon II X2 250) all 8 launchs take only ~0.01ms each. The kernel exec time is ~1ms for GPU and ~80ms for the CPU. (as a side note, clEnqueueNDRangeKernel command elapsed time is taken with QueryPerformanceCounter(), while the kernel time is using clGetEventProfilingInfo(). The OS is Win7 Pro 64bits).
aha.
but this still makes it hard to test the OpenCL application for memory leaks. is there a better tool than Valgrind?
Thanks Micah, Looking forward to it
What's up on this topic ? I'm coming from the CUDA world and trying to play around with my Radeon 6970. I'm a bit surprised with this clEnqueueNDRangeKernel issue !
I installed the 2.6 SDK over Catalyst 11.1, and while a small saxpy kernel runs a little bit faster than on my Tesla C2050 (less than 1ms) according to the OpenCL event profiling, the overhead of clEnqueueNDRangeKernel is 23ms ! That's huge for an asynchronous call, it kills any chance of getting performance.
I installed today the 12.1 release of the Catalyst driver over the 2.6 SDK, and it doesn't change anything. When I select my CPU device, the call to clEnqueueNDRangeKernel is really asynchronous and takes nothing. On my Nvidia platform it's around 0.1ms.
My configuration is :
- Ubuntu 11.10 64bits
- AMD FX-6100
- 8GB RAM
- AMD Radeon HD 6970 Cayman
Version information from clinfo :
Device OpenCL C version : OpenCL C 1.1
Driver version: CAL 1.4.1664
Version: OpenCL 1.1 AMD-APP (851.4)
The issue you are likely seeing is that your commands are sitting in a queue and not being flushed. Try the following:
- Enqueue your commands, saving final event in queue
- Call clFlush() for your command queue
- Then call clGetEventInfo(), checking CL_EVENT_COMMAND_EXECUTION_STATUS until the event changes to submitted
Hi Jeff,
Thanks for your answer, but I can't understand how should I call clFlush *while* my call to clEnqueueNDRangeKernel is blocking !! Do you mean I should call clFlush *before* calling clEnqueueNDRangeKernel ?
My pattern is : I enqueue a sequence of 1) write buffer 2) enqueue kernel 3) read buffer ; each of these steps is followed by a waitForEvent() ; which implies the queue is empty when I submit the next command.
To be clear: the more-than-20ms is the time of clEnqueueNDRangeKernel call solely, not anything related to the time between I submit to the queue and the time the command will be issued from the queue for execution.
Thanks,
Mehdi
clEnqueueNDRangeKernel is not a blocking call, and can't be. Blocking calls are clWaitForEvents, clFinish, clEnqueueReadBuffer/clEnqueueWriteBuffer/clMapBuffer/etc. with blocking flag set. Calling clEnqueueNDRangeKernel just adds (enqueues) the command to the command queue.
Jeff,
This is exactly what I complain for ! 🙂
clEnqueueNDRangeKernel() should be asynchronous (and, ok, *is* asynchronous), but when called on a queue mapped to a GPU device it has an heavy latency, which I abusively called "blocking".
My code is :
timer_start(); // gettimeofday, initialize internal counter
clEnqueueNDRangeKernel(...);
timer_stop_display(); // gettimeofday, compare to internal counter and display
I get more than 20ms for this single call, but only if the queue is associated to the GPU. If a chose the CPU device it takes "no" time.
If this is an unknown issue for you, then I may have an issue with my setup. I should begin writing you a small test case to reproduce it.
If you never flush the command queue, how can the device start processing the commands? Without clFlush() your commands will sit in the command queue. AFAIK, nvidia doesn't support threading for the command queue, so you won't see this problem there.
I can't see your point with threading ? I don't have any host thread...
I do not believe I *have to call* clFlush for execution (it works well without...), and I can't find any reference in the standard about that.
I'm writing a small test case for reproduce the issue outside of my framework.
The point with threading is that you might have multiple command queues. If enqueuing a kernel to one command queue was blocking, then you wouldn't be able to overlap execution with work in the other command queue.
If you want to start execution *now* then you had better call clFlush() else the commands will get process at the driver's leisure
I wrote a small test case for reproduction :http://pastebin.com/fije3CKf ; save it to a C file and compile it with :
gcc my_source_file.c -L /opt/AMDAPP/lib/x86_64/ -l OpenCL -I /opt/AMDAPP/include/ -o test_latency
After allocating OpenCL stuff in the main(), the function openclSimpleCopy() is called and executes always this sequence:
The interesting thing is: if it is called with n > 10^6 ; the kernel enqueue will take time, else it's almost instantaneous
At the beginning of the main you can change the device id to chose (or change CL_DEVICE_TYPE_ALL to CL_DEVICE_TYPE_CPU or CL_DEVICE_TYPE_GPU).
If a select the CPU with AMD APP 2.6:
$ ./test_latency
Run with n = 10000000
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Run with n = 100000
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
If I select my AMD GPU :
~$ ./test_latency
Run with n = 10000000
Time for Enqueue : 24.7
Time for Enqueue : 15.7
Time for Enqueue : 19.2
Time for Enqueue : 18.4
Time for Enqueue : 18.3
Time for Enqueue : 19.3
Time for Enqueue : 18.3
Time for Enqueue : 18.3
Time for Enqueue : 19.2
Time for Enqueue : 18.3
Run with n = 100000
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.1
Time for Enqueue : 0.1
Time for Enqueue : 0.1
Time for Enqueue : 0.0
On an Nvidia 4.1 runtime :
$ ./test_latency
./test_latency: /usr/lib/libOpenCL.so.1: no version information available (required by ./test_latency)
Run with n = 10000000
Time for Enqueue : 0.4
Time for Enqueue : 0.2
Time for Enqueue : 0.2
Time for Enqueue : 0.2
Time for Enqueue : 0.2
Time for Enqueue : 0.2
Time for Enqueue : 0.2
Time for Enqueue : 0.2
Time for Enqueue : 0.2
Time for Enqueue : 0.2
Run with n = 100000
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
Time for Enqueue : 0.0
i added same clEnqueueNDRangeKernel() before your time measurment. then i got 0.0 in both cases. IMHO OpenCL runtime transfer your buffers beforehand. so i don't see issue here.
Thanks for checking that, nou, I had issues accessing the source code due to firewall restrictions.
One thing to note is that our OpenCL implementation uses deferred allocation so we don't allocate resources until first use. On top of that, VidMM in Windows also doesn't migrate allocations to a device until the device accesses it. In your case, joker-eph, it seems likely you're measuring the overhead of VidMM migrating resources to the GPU. The larger the resource, the longer it will take VidMM to migrate the resource as it first zeroes the memory due to content protection.
I don't know what VidMM stands for ? Video Memory Management ? By the way I'm on linux.
So in your implementation when I call enqueueWriteBuffer() on a queue associated to a GPU, the runtime only gets a copy of my data on the host and it will really send the data to the GPU when a kernel will need it ?
Then, if I understand well I cannot measure in any way the transfer of data from the host to the device, since event if I wait on the event associated to enqueueWriteBuffer() I will time only the caching of the data by the runtime ?
It also mean that I cannot transfer data "in advance" for a later use on the GPU ?
This is very interesting to know ! In any case I guess it should be handled by the runtime and should "block" clEnqueueNDRangeKernel() in any way. I mean, even if there is some memory to manage before launching the kernel, here it's only an asynchronous call to enqueue a kernel for a future launch.
Thanks !
If you use clEnqueueWriteBuffer, then it will depend on some factors. First, if the buffer is in host memory, then it will be copied immediately (if not currently in use). If the buffer is on the GPU, then we will schedule a copy. If the command is blocking, then it will happen immediately, but you may have to wait for previous commands to complete.
Again, I can't access your code right now, so I don't know exactly what you are doing. But it sounds to me like you are measuring some other cost, not the cost of the clEnqueueNDRangeKernel().
I don't know what you mean exactly by "not the cost of the clEnqueueNDRangeKernel()." ?
Anyway here is the relevant part of the code, showing that I create an OpenCL buffer, I call clEnqueueWriteBuffer() on a queue associated to the GPU (I tested synchronous or asynchronous), then call clFlush() and clWaitForEvents(). Finally I time the call to clEnqueueWriteBuffer().
// Buffers on the device
cl_mem a_dev = clCreateBuffer(context,
CL_MEM_READ_WRITE,
n * sizeof(int),
NULL,
&cl_error);
OpenCL_test_execution("Create Buffer",cl_error);
cl_mem b_dev = clCreateBuffer(context,
CL_MEM_READ_WRITE,
n * sizeof(int),
NULL,
&cl_error);
OpenCL_test_execution("Create Buffer",cl_error);
// 3 events is enough here
cl_event event1;
cl_event event2;
cl_event event3;
// Initialize buffer on the device
cl_error = clEnqueueWriteBuffer(queue,
a_dev,
CL_TRUE,
0,
n * sizeof(int),
a,
0,
NULL,
&event1);
OpenCL_test_execution("Write to Buffer",cl_error);
// Shouldn't be useful, I used a blocking write !
clFlush(queue);
clWaitForEvents(1,&event1);
// Arguments for the kernel
cl_error = clSetKernelArg(kernel,0,sizeof(a_dev), &a_dev);
OpenCL_test_execution("Set argument 0 ",cl_error);
cl_error = clSetKernelArg(kernel,1,sizeof(b_dev), &b_dev);
OpenCL_test_execution("Set argument 1",cl_error);
timer_start();
cl_error = clEnqueueNDRangeKernel(queue,
kernel,
1,
NULL,
&n,
NULL,
1,
&event1,
&event2);
timer_stop_display("Time for Enqueue");
If you make clEnqueueWriteBuffer() blocking, as you've done, then there's no need to wait for events. Similarly, there's no need to pass in that event as a dependency for clEnqueueNDRangeKernel.
In your case, b_dev won't be allocated until the kernel dispatch. You can avoid this by touching the resource with the device before you start your timer.
Warming up buffers/kernels is a common technique as then you remove one-time costs from measurements.