cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

laughingrice
Journeyman III

kernel launch time way too long

I'm trying to convert some code to OpenCL under the e350 (brazos) architecture with windows 7 64bit (although the application is 32bit).

Timing the following three lines of code takes about 1.2ms

size_t globalSz[2] = {320, 240};

clEnqueueNDRangeKernel(queue, Test, 2, NULL, globalSz, NULL, 0, NULL, NULL);

clFinish(queue);



I believe that kernel launch overhead should be more in the area of 30us, so I'm guessing that I'm doing something wrong. The kernel itself is just the empty kernel so that I'm only supposed to be seeing the kernel lauch overhead. The original C code runs for about 1ms, so 1.2 ms kernel lauch overhead is unacceptable. Any ideas what I may be doing wrong?

 

Thanks

 

0 Likes
5 Replies

You can't base performance off a single launch as there are a lot of one time costs.  For example, in Windows buffers aren't allocated on the GPU until the GPU uses them, so the first time you touch a resource, there is a higher cost.

Jeff

0 Likes

I know that. its not only buffer allocation but can also be the final compilation. in this case though there are no buffers and what i posted was the best run time out of 4. All three others were worse.

0 Likes

laughingrice,

I suggest you to try to time the region enclosed within lines and find average time.

clFinish(cmdQueue);

--------------------------------------------

for(int i =0;i<10000;i++)

{

clEnqueueNDRangeKernel(...);

}

clFinish(cmdQueue);

---------------------------------------------

I hope you will see some improvement.

Thanks

0 Likes

Tried that

 

Ver1:

------------

for(int i =0;i<1000;i++)

{

clEnqueueNDRangeKernel(...);

}

clFinish(cmdQueue);

------------

Ver2:

 

for(int i =0;i<1000;i++)

{

clEnqueueNDRangeKernel(...);

clFinish(cmdQueue);

}

------------

 

Average run time for version one is ~900us. For version two it's down to 46ms/1000 = 46us, which is more acceptable, but only if I have a LOT of kernels that don't require synchronization

 

Running the same code on a NVIDIA Tesla c1060 that I have lying around, ver1 takes about 100us per lauch (compared to 40us for CUDA) and ver 2 takes 36ms/1000 = 36us. Admitedly though it's a much more powerful computer.

 

Is anyone else seeing these launch times?

 

Thanks

0 Likes

okay so it seems you are getting acceptable values in the second case. 

Although I was expecting to get better result for version 1 where  kernels would be dispatched in batches rather than one after other.

I hope you put a clFinish before starting the timer in the first version.

Also I prefer to use some reliable system timers or profiler instead of querying time using cl_event objects as there is some overhead for them itself.

And improvement in kernel lauch time is a known feature request and I think you will find it improved as compared to previous AMD APP SDK versions. You can expect further improvements in future releases.

Thanks

0 Likes