When running the application timeline trace (via CodeXL 1.5.xxx) it shows that there are huge gaps between the kernel executions (e.g., 10-12 ms). I am working on a real-time application where we're streaming data in and out of the GPU and these gaps are causing the processing to go slower than real-time.
To give you some more information, we're using the Firepro S9150 in a Linux x64 environment. There is a series of 7 kernel executions, the first six provide inputs to the final "stage". After the final stage, the process repeats; this is where the gap usually occurs. The inputs have been copied concurrently during the previous iteration's execution and I can see that the input has completed properly.
The kernel executions are strung together (timed?) using events. For example, each kernel can't execute before the inputs are complete. And the final stage is added to the same queue as the reset of the kernels so it is the last to execute (thus, all of its inputs are prepared and in memory by the time it starts).
I have looked at the profiler and I don't see any reasons as to why the kernel is failing to launch (or, rather, anything that is changing which could trigger it to launch).
It appears that the delay is caused by reading the results back from the GPU. I have removed the read and now the kernel execution only has about 0.2ms delay (which still stinks, but it is much better than 14ms!). How can I get my reads back without any delays?
I dont think read backs can be done without delays. But depending on how you can arrange kernel executions, you could do a readback whose execution is done simultaneously with a kernel execution.
say you have A, and B kernels.
Execution is A->B->A->B.....
B produces results that are read back on CPU using a clEnqueueReadBuffer with blockingRead(true). Instead of using a blockingRead =true, it can be set to false and this will allow you to read the buffer without blocking on clEneuqueReadBuffer line. A clFlush() maybe required just before the next time B is run so that the buffer is read before its changed again by B.
I use the above option with separate queues: one for executing kernels and other for read/writes, and with some well placed clFinish/flushes. That might help the GPU to arrange some kernel executions with reads saving some time.
Yeah, I suppose I should have said that I have three command queues. One is for inputs, one is for kernel execution and one is for outputs.
At the beginning I load in many jobs into the input, kernel, and output queues. The inputs have events that cause the kernels to wait, and the output waits on
"an iteration" of the kernels. So let's say I queue up 21 kernels, and after every 7th kernel execution it reads. The read was queued up after the 7th, 14th, and 21st kernel execution. After the 7th kernel has completed, the read starts after ~7-8 ms. This occurs in the output queue. After some period where nothing happens in the kernel queue the 8th kernel starts running (e.g., another 7-8 ms). The gap between the 7th kernel and the 8th kernel is about 10-14ms in total. Also, the output and 8th kernel are executing at the same time, once the 8th kernel finally starts.
Keep in mind that the 8th kernel is queued up at the beginning of this process, so it would be maybe 50ms earlier...So, in short, I believe I am doing everything that I need to do for concurrent transfers.
Here are some bonus details: kernel 7 and kernel 8 are different kernels. When running kernel 7 repeatedly with the read-back the delay seems to be somewhat smaller and more consistent. Is there some delay when switching between kernels or something?
Sorry, this reply might be quite late. However, just want to know whether you're still facing the problem or not? If so, could you please share a reproducible test-case?