1 of 1 people found this helpful
I have not worked with JOCL, but here are a few basic pointers which can be helpful:
1. I have faced some issues with event profiling myself. Better to use some standard system timers for time measurement.
2. Do a clFinish(queue) call before and after the clEnqueueNDRangeKernel. So time it like
Also try something similar to measure clEnqueueReadImage time.
3. 10ms should mean a data transfer of about 40MB at a decent rate of 4GBps. Which might be the size of your image, in which case the profiler reports it correctly.
4. I do not see any kernel at the above link, but maybe the kernel is not very compute intensive. Please share the kernel too
Hope it helps.
Thanks for the input, heman!
clFinish() certainly made a difference. After adding clFinish() after clEnqueueNDRangeKernel() I get the following times:
Time spent in clEnqueueNDRangeKernel: 82
Time spent in clEnqueueReadImage: 9
Which make more sense, but the strange thing is that clEnqueueNDRangeKernel() takes ~75-80ms even when the kernel body is empty?
The OpenCL Runtime can choose to do memory-copies and other activities under the covers - in a lazy fashion. This may be a reason for the bloated time for clEnqueueKernel() API. But this is just an educated guess.
To verify, consider using a "BLOCKING" argument to all your memory copies and other async operations to make them synchronous..... And try timing after that. That might help.
The kernel launch overhead is always there, but that is in microseconds generally.
Nice to hear that execution times are now being reported in expected manner.
Are you passing any arguments to the empty kernel? Can you please attach the files (modified host code + empty kernel you used)?
Thanks guys! I'll look into this tomorrow and report my findings. I suspect I will have to try clEnqueueMapImage & friends, get some proper benchmarks and determine what's going on.