I'm just getting started with OpenCL programming, so please be gentle!
I have written a simple image resizing kernel using JOCL (Java OpenCL bindings) and I'm running this on a AMD HD 7970 GHz with the latest Catalyst drivers (9.002-120928m-149042C-ATI). The code works fine, but it is fairly slow: ~100ms to resize a 6299x4725 image to 1/4th it's size. After running JProfiler I find that 99% of the 100ms is spent in clEnqueueReadImage(). What could be causing this?
I have also tried enabling OpenCL profiling and reading the results using CL_PROFILING_COMMAND_START/CL_PROFILING_COMMAND_END, but (end-start) gives me 9506814 ns (~9.5ms) which does not match well the results I'm getting in JProfiler. Could this indicate a bug in JOCL?
Screenshot from JProfiler:
I have not worked with JOCL, but here are a few basic pointers which can be helpful:
1. I have faced some issues with event profiling myself. Better to use some standard system timers for time measurement.
2. Do a clFinish(queue) call before and after the clEnqueueNDRangeKernel. So time it like
Also try something similar to measure clEnqueueReadImage time.
3. 10ms should mean a data transfer of about 40MB at a decent rate of 4GBps. Which might be the size of your image, in which case the profiler reports it correctly.
4. I do not see any kernel at the above link, but maybe the kernel is not very compute intensive. Please share the kernel too
Hope it helps.
Thanks for the input, heman!
clFinish() certainly made a difference. After adding clFinish() after clEnqueueNDRangeKernel() I get the following times:
Time spent in clEnqueueNDRangeKernel: 82
Time spent in clEnqueueReadImage: 9
Which make more sense, but the strange thing is that clEnqueueNDRangeKernel() takes ~75-80ms even when the kernel body is empty?
The OpenCL Runtime can choose to do memory-copies and other activities under the covers - in a lazy fashion. This may be a reason for the bloated time for clEnqueueKernel() API. But this is just an educated guess.
To verify, consider using a "BLOCKING" argument to all your memory copies and other async operations to make them synchronous..... And try timing after that. That might help.
The kernel launch overhead is always there, but that is in microseconds generally.
Nice to hear that execution times are now being reported in expected manner.
Are you passing any arguments to the empty kernel? Can you please attach the files (modified host code + empty kernel you used)?
Thanks guys! I'll look into this tomorrow and report my findings. I suspect I will have to try clEnqueueMapImage & friends, get some proper benchmarks and determine what's going on.