I have a problem with delays in kernel execution when I request callbacks from OpenCL.
In my application, I need to execute kernels at a "very" high rate (around 300Hz), and I need a callback to my host application every time execution has finished. However, I am seeing large delays in kernel-to-kernel execution when getting these callbacks, even when there is another kernel waiting in the queue.
To investigate, I have created a test program that enqueues 100 kernels into a queue. Looking at the CodeXL timeline trace, all the kernels are executed just after each other with around 1 - 3 us delay.
However, when I request a callback on one of the events, the subsequent kernel has a execution delay of around 0.25 ms. Even though the callback is completely empty (just a return statement).
The callback is executed is executed immediately (within i few us), but the next kernel waits to execute for some time. In the attached image, it can be seen when I request a callback for every 5th execution of my kernel:
I calculated that this delay can account for up to 25% of my GPU processing power when executing at 300 Hz
Can anybody shed some light on this?