It appears that the delay is caused by reading the results back from the GPU. I have removed the read and now the kernel execution only has about 0.2ms delay (which still stinks, but it is much better than 14ms!). How can I get my reads back without any delays?
There are two examples in SDK. AsyncDataTransfer and TransferOverlap. you should look into them.
I dont think read backs can be done without delays. But depending on how you can arrange kernel executions, you could do a readback whose execution is done simultaneously with a kernel execution.
say you have A, and B kernels.
Execution is A->B->A->B.....
B produces results that are read back on CPU using a clEnqueueReadBuffer with blockingRead(true). Instead of using a blockingRead =true, it can be set to false and this will allow you to read the buffer without blocking on clEneuqueReadBuffer line. A clFlush() maybe required just before the next time B is run so that the buffer is read before its changed again by B.
I use the above option with separate queues: one for executing kernels and other for read/writes, and with some well placed clFinish/flushes. That might help the GPU to arrange some kernel executions with reads saving some time.
Yeah, I suppose I should have said that I have three command queues. One is for inputs, one is for kernel execution and one is for outputs.
At the beginning I load in many jobs into the input, kernel, and output queues. The inputs have events that cause the kernels to wait, and the output waits on
"an iteration" of the kernels. So let's say I queue up 21 kernels, and after every 7th kernel execution it reads. The read was queued up after the 7th, 14th, and 21st kernel execution. After the 7th kernel has completed, the read starts after ~7-8 ms. This occurs in the output queue. After some period where nothing happens in the kernel queue the 8th kernel starts running (e.g., another 7-8 ms). The gap between the 7th kernel and the 8th kernel is about 10-14ms in total. Also, the output and 8th kernel are executing at the same time, once the 8th kernel finally starts.
Keep in mind that the 8th kernel is queued up at the beginning of this process, so it would be maybe 50ms earlier...So, in short, I believe I am doing everything that I need to do for concurrent transfers.
Here are some bonus details: kernel 7 and kernel 8 are different kernels. When running kernel 7 repeatedly with the read-back the delay seems to be somewhat smaller and more consistent. Is there some delay when switching between kernels or something?
Sorry, this reply might be quite late. However, just want to know whether you're still facing the problem or not? If so, could you please share a reproducible test-case?