Hi to everyone,
I am facing a weird problem with an OpenCL code that I am developing.
In short words, I have a certain number of kernels that are called inside a loop. The problem that I am facing is that the total time of the program is like two o three times the execution time of the kernels, measured with events profiling. For example, for a AMD FX8350 CPU + AMD Radeon HD7970 GPU (OpenCL running on the GPU) I obtained:
total time : 3275117 ms
kernel time : 1415446 ms
I use events for profiling, and when I remove the clWaitForEvents() function I obtain:
total time : 1857250 ms
I tried also removing from inside the loop the clSetKernelArg() functions, but the time gain was minimal. Also, there write/read to/from the device is minimal, so it should not be the source of this problem (I have tested it) Anyways, that seems quite weird for me, as I have never seen such overhead from clWaitForEvents(). Moreover, if I run the program with an Intel CPU I obtain not such difference.
Any clue about this behavior? Thanks for your help in advance!