not the actual execution time, just the function call
Hello!
I get very high times on the clEnqueueNDRangeKernel() call when using GPU. Look at the code below. I use CL_QUEUE_PROFILING_ENABLE on the command queue to do profiling on 'event', and (end-start) to get time of function call.
On CPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.729372 ms
end - start = 0.003000 ms
On GPU:
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED = 0.797341 ms
end - start = 4.194000 ms
If I use big global sizes this extra time is irrelevant, and GPU is faster than CPU. But I want my implementation to be fast on GPU even at small global sizes, and this big function call time stops that. Am I doing something wrong?
clFinish(hd->command_queue); gettimeofday(&start, NULL); status = clEnqueueNDRangeKernel(hd->command_queue, hd->kernel, 1, NULL, &globalSize, &localSize, 0, NULL, event); gettimeofday(&end, NULL); clFinish(hd->command_queue);