I have a program that needs to execute several kernels in successive order for about 10000 times.
Now the execution time of the kernels seems to be fine for me, but when i started profiling the events, it shows a large delay before the kernels are even executed.
Normally this wouldn't bother me too much, but as I need to do this many times, these "queue" times sum up and let the runtime explode. (The "wasted" time actually shows up in system time)
I attached an example code that shows how I handle the Enqueue calls and the profile routine I use.
Additionally an output file is attached that shows the (start-queued) time and (end-start) times in ms of my used kernels.
As you can see, the average "queued" time is longer than the execution time itself.
Note: openmp parameters aren't activated right now (NUM_THREAD_ID=0), therefore only one queue etc is used
2nd Note: The clWaitForEvents are only to assure the kernels have finished before the profile is made. Removing events and waits doesn't improve the wall time.
Is there anything I can do to the Enqueue calls or something else to improve these "queue" times effectively?