Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Long time between enqueue and execute on GPU

The profiler is showing that it takes about 0.2ms from the time clEnqueueNDRangeKernel is finished until the actual kernel starts. This really really adds up. I know you should queue up lots of kernels but that often isn't an option. Is this going to be fixed anytime soon?

12 Replies

There will be a delay unless you flush the commands to the GPU immediately.  That means issuing a blocking call such as clFinish().  The actual batch overhead is less than 60us on most systems.  Per dispatch (i.e. each clEnqueueNDRangeKernel call) overhead is less than 8us.



Thank your for your answer. I have tried bot with and without clFlush() immediately after and the clEnqueueNDRangeKernel but there appears to be no noticiable improvement. We have been experience this slow launch times for a while on at least 5 different systems of varying types from server and desktop hardware to embedded systems and the times seem to be fairly consistently slow. Example programs do not demonstrate this time to launch.


clFlush() only wakes up the command processing thread, it doesn't break the batch.  So if more command come into the queue while the thread is active, it will keep adding more commands in and hold off flushing the commands to the GPU.

Adept I

Just to add-up to the same problem I face. I have a compute intensive kernel. Previously I used to profile using clock_gettime() function on linux. Then I started using clGetEventProfilingInfo() . There is a huge timing difference when I use CL_PROFILING_COMMAND_SUBMIT & CL_PROFILING_COMMAND_START as start_time. The timing dump is as follows:


err = clGetEventProfilingInfo( timeEvent, CL_PROFILING_COMMAND_SUBMIT, sizeof(cl_ulong), &startTime, NULL );

err = clGetEventProfilingInfo( timeEvent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &stopTime, NULL );

Iteration 1 takes 8.625875 ms

Iteration 2 takes 10.488933 ms

Iteration 3 takes 8.605963 ms

Iteration 4 takes 8.602357 ms

Iteration 5 takes 12.417702 ms

Iteration 6 takes 8.600140 ms

Iteration 7 takes 8.576233 ms

Iteration 8 takes 12.780842 ms

Iteration 9 takes 8.640809 ms

Iteration 10 takes 8.582085 ms

Iteration 11 takes 14.690402 ms

Iteration 12 takes 8.643072 ms

Iteration 13 takes 8.581202 ms

Iteration 14 takes 14.774370 ms

Iteration 15 takes 8.648169 ms

Iteration 16 takes 17.066158 ms

Iteration 17 takes 8.671902 ms

Iteration 18 takes 8.562560 ms

Iteration 19 takes 14.703254 ms

Iteration 20 takes 8.661131 ms

Iteration 21 takes 8.571542 ms

Iteration 22 takes 13.148671 ms

Iteration 23 takes 8.639865 ms

Iteration 24 takes 8.591887 ms

Iteration 25 takes 14.783643 ms


err = clGetEventProfilingInfo( timeEvent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &startTime, NULL );

err = clGetEventProfilingInfo( timeEvent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &stopTime, NULL );

Iteration 1 takes 7.775704 ms

Iteration 2 takes 7.769630 ms

Iteration 3 takes 7.784592 ms

Iteration 4 takes 7.757333 ms

Iteration 5 takes 7.775111 ms

Iteration 6 takes 7.769630 ms

Iteration 7 takes 7.754963 ms

Iteration 8 takes 7.765926 ms

Iteration 9 takes 7.762370 ms

Iteration 10 takes 7.785481 ms

Iteration 11 takes 7.768297 ms

Iteration 12 takes 7.769778 ms

Iteration 13 takes 7.781185 ms

Iteration 14 takes 7.763111 ms

Iteration 15 takes 7.776444 ms

Iteration 16 takes 7.775704 ms

Iteration 17 takes 7.766667 ms

Iteration 18 takes 7.767852 ms

Iteration 19 takes 7.769629 ms

Iteration 20 takes 7.762222 ms

Iteration 21 takes 7.760149 ms

Iteration 22 takes 7.775259 ms

Iteration 23 takes 7.747111 ms

Iteration 24 takes 7.766815 ms

Iteration 25 takes 7.770667 ms

I know there should be a delay in kernel setup, but i expected it to be in micro-seconds. Here there is a difference of almost a milli-second.

Also, what is the reason for burst-timings in first case?


You guys all seem to have totally unrealistic expectations here, which seems to be derived from a lack of understanding about fundamental computer and operating systems architecture.  e.g. the first timing set is bursty because you're not running on a real-time operating system and other events can get in the way (and as Jeff enlightened us, clFlush() isn't even a synchronous call on amd's driver).

And the whole reason these things have queues and buffers and flush calls is that a complete round-trip on it's own is known to be expensive and limited by the laws of physics (e.g. RC delays in switching) as well as operating system overheads.

Timing 'enqueue' to 'complete' is like saying it takes 6 months (say) to make every car on the road because it takes that long to dig the oil, iron ore and coal out of the ground, make the steel, manufacture all the parts, and roll it off the production line.  But it doesn't really - it takes about 15 minutes (or whatever) to roll out each new one in turn because there's a whole lot of production 'in the pipeline'.

This pipe-lining is pervasive throughout every level of every system, in computer from memory fetch to instruction execution to i/o: opencl isn't the only api which exposes this to the programmer (and other complex systems too, from manufacturing, to biological systems).  And even with AMD's plans for unifying the processing space, it will always still be a bit more expensive to move jobs around devices - and the devices themselves will be faster too so it still may be a significant portion of time; so the problem is always there even if you suddenly don't really care about it because it seems fast enough for your own problem.

A more telling metric is how long it takes to launch one kernel after the previous one has finished - after both were added to the queue before a flush or synchronised call.  If that was in the ms, i'd be worried ...

That's why marketing material of any data transfer mechanism (e.g. networking) always talks about 'bandwidth', and not 'latency' - the latency is an issue in some algorithms, but in general the bandwidth is much more important.  Not every algorithm will fit the restricted execution model of the gpu efficiently, and the latency is just one restriction software design must accommodate.  Many algorithms can be changed to hide it, or other work executed concurrently to utilise the otherwise idle time (e.g. running the graphics on a shared device).  But others are fully exposed to it.

You can't just replace a single matlab routine with a call to opencl and expect miracles.


I have expectations based on other companies drivers which do not experience such issues. You seem to be going on a rant about my lack of understanding of the hardware involved not addressing the actual problem. I have since spoken with someone at AMD who has addressed this as a real problem they experience internally so it seems you are the one who needs to rethink your expectations.

Please do not insult my intelligence especially when you are simply wrong.


Well, i'm not the one having performance problems.


Well according to you I don't have performance problems it is just my expectations.

Now that I think about it why try to optimize I'll just do what you do and set really low expectations so I will always meet them.

It is funny because I am questioning long times between two driver events (not any code I wrote) which I have no control over yet you seem to be implying I am having performance problems.


know there should be a delay in kernel setup, but i expected it to be in micro-seconds. Here there is a difference of almost a milli-second.

Also, what is the reason for burst-timings in first case?

The extra burst times are probably from the OS updating the video display as it happens every 30 ms.

I wonder if your extra 800 us is for transferring data used by your kernel. It's quite uniform. Try using 1/2 the data and see what happens.

Journeyman III

Note for AMD: Why do posts get set "Assumed Answered" when there hasn't been a single answer marked as helpful or correct? In fact their hasn't been an attempt at answering the original issue except to tell me to flush the queue which was already happening.

Journeyman III

We experience similar issues with kernel startup time. We've tried AMD 7970 vs GeForce 590 vs GeForce 680 configuration and found that computation intensive operations are way faster on new AMD architecture (which is much more suitable for our tasks than Kepler). But all the time we've saved on computation is then wasted on startups of cheaper kernels. In our situation this mean that we have to invest in older nVidia Tesla cards instead of waiting FirePro based on new architecture. In our particular case we speak about ~200 professional cards. Let this number be a lower bound esteem on how much this problem does really cost. Can we expect any concrete feedback from AMD developers on the issue?

Adept II

I could confirm it. Launch latency is about 200us and constant CPU overhead is about 20us per call.

Could we expect some progress with SI and new runtime?