I am using OpenCL to perform some real time data processing. Each iteration, I copy the new data to the GPU, queue a series of kernels, and copy the results back to the CPU when the kernels are finished. However, I am having problems with there being a large delay between the kernels being added to the queue, and the kernels starting to run on the GPU. This delay is killing any real time performance.
I am using a Radeon HD 6320 Fusion running on 64 bit Ubuntu 12.04. As my code is tightly integrated with a large existing software base, I can't easily share it, but I have very quickly put together a simple example that has roughly the same structure as my real code and that shows the delays (note that the example doesn't actually do anything at all meaningful and the kernels are just doing busy work). The example code is attached to the post.
A screenshot from CodeXL showing what is happening is below:
The blocking call to clEnqueueReadBuffer takes 27.8ms to run, yet the kernels only take 5ms in total to run, with 18ms elapsing between the call to clEnqueueReadBuffer and the first kernel starting execution.
I have also profiled this code using an Nvidia nvs300 GPU, which is approximately equivalent in power. The kernels themselves take around the same time to run as on the ATI card (as expected), but there are almost no delays in the queue on the Nvidia chip, as can be seen in the screenshot from the Nvidia profiler below:
As it is, the delay on the ATI chip prevents it being used for any kind of real time processing, when the Nvidia card has no problems at all. Is there any way to prevent this large delay from occurring?
use unblocking read/write. as you use blocking write call it must synchronize GPU which is expensive operation. if you remove blocking read/writes you get rid of delays.
Thanks. The write shoudn't be blocking, but changing it makes no difference to the delay. My problem is though that I need the results of the computation copied back to the CPU before I can proceed with the next iteration. This means that I either need to use a blocking read or use an unblocking read and clFinish. Either way, the delay is unchanged.
18ms is an absolutely ginormous amount of time needed for synchronization on the ATI card (sometimes if the kernels are more complicated, significantly longer - the delay seems to perhaps have some relationship with the execution time of the kernels), especially as the same operation on a similar Nvidia card is at least 100 times faster. If this is true, then that is extremely poor.
I increased the number of iterations in my sample code up to 1000 and profiled it once with blocking read/writes, and once with non blocking read/writes (ignoring incorrect data being copied in the case of the non blocking read/writes).
The profiling results for blocking reads/writes are:
Zoomed up, there is a delay of approx 18ms between each iteration (as before). The program takes 22 seconds from the first item being queued to the last item in the queue being run. The profiling results when the read/writes are changed to non-blocking are:
There is now no delay between kernels once the first one start executing, but it takes a whooping 12 seconds (!!!!) between the first kernel being queued and the first kernel being run. Overall, the program takes 22.7 seconds from the first kernel being queued to the last one int he queue being run. Therefore, removing the blocking read/writes changes where the delays are, but doesn't change the overall magnitude of the delays. Again, these delays simply don't exist on Nvidia hardware.
I was able to get a slight improvement in the overall runtime by placing a clFush() at the end of each iteration (the overall runtime was 12 seconds, compared to 22 seconds before). This caused the first kernel to start executing reasonably quickly, but introduced delays between each iteration, even with non-blocking reads/writes. Unfortunately, the delays between each iteration while the queue was still being changed were around 20ms (despite non blocking reads/writes), which makes this no better than the original blocking calls. Only when the queue wasn't being changed did the delays between each iteration decrease to around 6ms.
Why do you think that incorrect data will be copied? If you enqueue copy/write before kernel it will copy before kernel execute. (unless you enable out of order queue but AMD doesn'ŧ support that)
also what you see is most likely bug in profiler. I found that profiler is unreliable. For example I found out that with two queues on single GPU it shows parallel execution of kernels even though GPU can't execute them in parallel. I saw this double time issue too. I recommend forget about profiling and measure execution times with normal timers. Profiling introduce tons of weird bugs.
The user has profiled SuBmit to Start Latencies and found that 13.4 driver produces much better time.
You can probably run the sample code of that user and see what latencies are there in your system.
That will make a good test case.
What is the actual complaint here? Is it start up time, execution time, or copy time?
Looking at your second profiling run, something seems broken as it's not possible for all the kernel executions and copies to happen in clReleaseCommandQueue if you are using blocking copies.
When I run your test in Windows, I see about 500ms of startup time before the first kernel is executed. Gaps between batches are around 100us. Also, the example code I am looking at doesn't use blocking copies, but does have clFinish() after each clEnqueueReadBuffer(), so there's no real difference between that and a blocking copy except for the extra API call.
There are some other problems here. You are calling clEnqueueWriteBuffer and clEnqueueReadBuffer without using pre-pinned host memory. This means the runtime has to pin and unpin the host memory each time. (We optimize to only pin and unpin a chunk of memory once per batch.) For small transfers we avoid the extra pin/unpin but it means doing two copies. I recommend that you change your upload/download paths to use the prepinned copy path as documented in the SDK docs. The basic gist is that you create a buffer with CL_MEM_ALLOC_HOST_PTR and use CL_MEM_READ_ONLY or CL_MEM_WRITE_ONLY for readback and upload buffers respectively. Then you can Map() the buffer for free and use that host ptr for your clEnqueueReadBuffer and clEnqueueWriteBuffer commands. You can even update the memory directly in that buffer (for the upload), avoiding the need for a memcpy to it.
Thank you for the reply. I was using 13.3. Unfortunately though, updating to 13.4 made no difference to the latencies.
I tried the sample code from the link on my system and got:
total time queue->submit submit->start start->end
0.853 ms 0.020 ms 0.772 ms 0.009 ms
0.251 ms 0.010 ms 0.198 ms 0.004 ms
0.237 ms 0.019 ms 0.183 ms 0.004 ms
From my own testing, the submit->start time increases significantly with the number of kernels in the queue and the complexity of the kernels.
Thank you for the reply.
My problem is the delay between kernels being added to the queue, and the kernels being executed.
My second profiling run used unblocking copies the entire time.
I'm not concerned about the initial startup time, but I am concerned about the gap between batches. On my system this gap is around 18ms (as opposed to your 100us). 100us is similar to what I get on my Nvidia card is is perfectly reasonable. 18ms is far too long however. This leads me to think that there might be a bug in the runtime environment with my chip and Ubuntu.
I have tried all the various ways to allocate and copy memory, including what you suggest, but this makes no noticeable difference to the delays - any slight speedup in the copying of memory is dwarfed by the submit->start delay.