I'm seeing massive delays between a kernel being submitted to an AMD GPU and actually executed. My program is doing blocking writes/reads (with blocking=CL_TRUE) to ensure that I/O isn't interfering with the kernel. I then use clGetEventProfilingInfo to get info on kernel queueing, submitting, starting and ending. The data (and code) below shows that the kernel spends about 3 seconds submitted, and then 3 seconds running. In general, it looks like the submitted time scales with the running time. I've looked at a number of forum posts about delays in kernel execution (for instance,http://devgurus.amd.com/thread/166587) but there doesn't seem to be a resolution there. I've checked that the GPU is not in low-power mode. Has anyone else seen this or have suggestions of how to diagnose it?
5 write: queued 0.000000 submit 0.023312 start 3296.444778 end 3335.371268 | submitted 3296.421466 running 38.926490
6 exec: queued 0.021067 submit 78.494703 start 3335.371268 end 6529.140138 | submitted 3256.876565 running 3193.768870
7 read: queued 0.024849 submit 79.085042 start 6529.140158 end 6578.664028 | submitted 6450.055116 running 49.523870
8 Overall 6583.000000 ms