I am using OpenCL to perform some real time data processing. Each iteration, I copy the new data to the GPU, queue a series of kernels, and copy the results back to the CPU when the kernels are finished. However, I am having problems with there being a large delay between the kernels being added to the queue, and the kernels starting to run on the GPU. This delay is killing any real time performance.
I am using a Radeon HD 6320 Fusion running on 64 bit Ubuntu 12.04. As my code is tightly integrated with a large existing software base, I can't easily share it, but I have very quickly put together a simple example that has roughly the same structure as my real code and that shows the delays (note that the example doesn't actually do anything at all meaningful and the kernels are just doing busy work). The example code is attached to the post.
A screenshot from CodeXL showing what is happening is below:
The blocking call to clEnqueueReadBuffer takes 27.8ms to run, yet the kernels only take 5ms in total to run, with 18ms elapsing between the call to clEnqueueReadBuffer and the first kernel starting execution.
I have also profiled this code using an Nvidia nvs300 GPU, which is approximately equivalent in power. The kernels themselves take around the same time to run as on the ATI card (as expected), but there are almost no delays in the queue on the Nvidia chip, as can be seen in the screenshot from the Nvidia profiler below:
As it is, the delay on the ATI chip prevents it being used for any kind of real time processing, when the Nvidia card has no problems at all. Is there any way to prevent this large delay from occurring?