I need some help understanding the profiling information that I have gathered so I can perform some optimizations. The GPU being used is HD5970. The OS is Linux. The order of commands in the experiment is:
4 non-blocking clEnqueueImageWrite()s
1 non-blocking clEnqueueImageRead()
All commands are on a single command queue. The event wait list is empty for all enqueue commands. Attached is the output with the times reported by clGetProfilingInfo() for different stages for each command. The times are in nanoseconds, starting from the queueing of the first image write (0 ns).
1. Why does there appear to be a delay of the order of milliseconds between the queueing and submission of commands?
2. Why does there appear to be a similar delay between the submission and start of the commands?
3. Why do some of the commands further down in the list appear to start before commands above them, given that the command queue is in-order?
4. If there is a known issue with these numbers, is it somehow possible to infer the actual values?
COMMAND QUEUED SUBMIT START END ImageWrite1 0 ns 151730 ns 26804937 ns 26805292 ns ImageWrite2 7982 ns 4357611 ns 26804937 ns 26805292 ns ImageWrite3 14109 ns 8074416 ns 26804937 ns 26805292 ns ImageWrite4 20582 ns 11748876 ns 26804937 ns 26805292 ns kernel1 45609 ns 15382279 ns 26714762 ns 26805292 ns kernel2 53920 ns 16298516 ns 26350001 ns 26805292 ns kernel3 61233 ns 16988515 ns 26348697 ns 26805292 ns kernel4 68034 ns 17475625 ns 26347526 ns 26805292 ns kernel5 88535 ns 17634632 ns 26615571 ns 26805292 ns ImageRead1 90667 ns 17790585 ns 25952634 ns 26805292 ns