I am using OpenCL to perform some real time data processing. Each iteration, I copy the new data to the GPU, queue a series of kernels, and copy the results back to the CPU when the kernels are finished. However, I am having problems with there being a large delay between the kernels being added to the queue, and the kernels starting to run on the GPU. This delay is killing any real time performance.
I am using a Radeon HD 6320 Fusion running on 64 bit Ubuntu 12.04. As my code is tightly integrated with a large existing software base, I can't easily share it, but I have very quickly put together a simple example that has roughly the same structure as my real code and that shows the delays (note that the example doesn't actually do anything at all meaningful and the kernels are just doing busy work). The example code is attached to the post.
A screenshot from CodeXL showing what is happening is below:
The blocking call to clEnqueueReadBuffer takes 27.8ms to run, yet the kernels only take 5ms in total to run, with 18ms elapsing between the call to clEnqueueReadBuffer and the first kernel starting execution.
I have also profiled this code using an Nvidia nvs300 GPU, which is approximately equivalent in power. The kernels themselves take around the same time to run as on the ATI card (as expected), but there are almost no delays in the queue on the Nvidia chip, as can be seen in the screenshot from the Nvidia profiler below:
As it is, the delay on the ATI chip prevents it being used for any kind of real time processing, when the Nvidia card has no problems at all. Is there any way to prevent this large delay from occurring?
use unblocking read/write. as you use blocking write call it must synchronize GPU which is expensive operation. if you remove blocking read/writes you get rid of delays.
I increased the number of iterations in my sample code up to 1000 and profiled it once with blocking read/writes, and once with non blocking read/writes (ignoring incorrect data being copied in the case of the non blocking read/writes).
The profiling results for blocking reads/writes are:
Zoomed up, there is a delay of approx 18ms between each iteration (as before). The program takes 22 seconds from the first item being queued to the last item in the queue being run. The profiling results when the read/writes are changed to non-blocking are:
There is now no delay between kernels once the first one start executing, but it takes a whooping 12 seconds (!!!!) between the first kernel being queued and the first kernel being run. Overall, the program takes 22.7 seconds from the first kernel being queued to the last one int he queue being run. Therefore, removing the blocking read/writes changes where the delays are, but doesn't change the overall magnitude of the delays. Again, these delays simply don't exist on Nvidia hardware.
I was able to get a slight improvement in the overall runtime by placing a clFush() at the end of each iteration (the overall runtime was 12 seconds, compared to 22 seconds before). This caused the first kernel to start executing reasonably quickly, but introduced delays between each iteration, even with non-blocking reads/writes. Unfortunately, the delays between each iteration while the queue was still being changed were around 20ms (despite non blocking reads/writes), which makes this no better than the original blocking calls. Only when the queue wasn't being changed did the delays between each iteration decrease to around 6ms.
Why do you think that incorrect data will be copied? If you enqueue copy/write before kernel it will copy before kernel execute. (unless you enable out of order queue but AMD doesn'ŧ support that)
also what you see is most likely bug in profiler. I found that profiler is unreliable. For example I found out that with two queues on single GPU it shows parallel execution of kernels even though GPU can't execute them in parallel. I saw this double time issue too. I recommend forget about profiling and measure execution times with normal timers. Profiling introduce tons of weird bugs.
nou has pointed out the right thing. You may also want to check
http://devgurus.amd.com/message/1296694#1296694 to get better performance while data transfer and kernel execution.
Thanks. The write shoudn't be blocking, but changing it makes no difference to the delay. My problem is though that I need the results of the computation copied back to the CPU before I can proceed with the next iteration. This means that I either need to use a blocking read or use an unblocking read and clFinish. Either way, the delay is unchanged.
18ms is an absolutely ginormous amount of time needed for synchronization on the ATI card (sometimes if the kernels are more complicated, significantly longer - the delay seems to perhaps have some relationship with the execution time of the kernels), especially as the same operation on a similar Nvidia card is at least 100 times faster. If this is true, then that is extremely poor.
Check this http://devgurus.amd.com/message/1296798#1296798
The user has profiled SuBmit to Start Latencies and found that 13.4 driver produces much better time.
You can probably run the sample code of that user and see what latencies are there in your system.
That will make a good test case.
Thank you for the reply. I was using 13.3. Unfortunately though, updating to 13.4 made no difference to the latencies.
I tried the sample code from the link on my system and got:
total time queue->submit submit->start start->end
0.853 ms 0.020 ms 0.772 ms 0.009 ms
0.251 ms 0.010 ms 0.198 ms 0.004 ms
0.237 ms 0.019 ms 0.183 ms 0.004 ms
From my own testing, the submit->start time increases significantly with the number of kernels in the queue and the complexity of the kernels.
Submit-start latency is bound to increase if your kernels keep waiting in the queue for the GPU because some other kernel is using the GPU... What we should look for is the submit-start latency of 1 kernel launch....
From the numbers, you have posted, I see it is around 200 microsecs. Quite on the high side.
In the thread I referred above, we are getting around 60 to 70 microsecs for HD 7770.
However, you seem to be using an APU model. *HD 6320 Fusion" - Is this a laptop/netbook (or) a desktop?
I am not sure if this is an APU specific issue.
Can you confirm so that I can test this here?
When I say the submit->start latency increases with the number of kernels in the queue, I mean the latency from the first kernel being submitted to that first kernel being started on the GPU increases if several kernels are added to the queue at once.
I am using a laptop APU (AMD G-T56N)
Hmm...Interesting... So, How many CPU cores does your platform have? A "clFlush()" at end of loop may help your timing when multiple kernels are queued up,
Also, A quick search on your platform landed me on AMD Embedded G-series page: http://www.amd.com/us/products/embedded/processors/Pages/g-series.aspx
These CPUs are meant for industrial automation/thin clients/Point of sale kiosks. Not sure, who has manufactured a laptop using that.
Can you tell me the laptop model and the manufacturer name?
I am developing a hunch that it is the low-power embedded single-core system that is the reason for the slowness.
It has two CPU cores. A clFlush at the end of the loops only helps when no more kernels are being added to the queue.
I have seen the Fusion line in low cost/low power laptops. The GPUs fitted on the chip are comparable, or at least should be comparable, to standard integrated graphics chips. The particular model I have is: http://www.fit-pc.com/web/fit-pc/fit-pc3-info/#Specifications
Are you sure your laptop is running in full power mode? I hope some power-saving is not slowing things down.
I hope your tests are run when the laptop is connected to the AC outlet.
Yes, it is connected to AC and power saving is turned off
Just to confirm that APU A6-4455M with HD 7500G GPU integrated observed the same phenomenon.
The delay between submit->start is extremely high, even longer than the kernel execution time itself under Kubuntu 12.10.
However, I have a windows port of my OpenCL kernel, I also test it under windows, and find that this large delay is greatly mitigated under windows 7.
Inserted Image is the codeXL snapshot under windows.
The submit->start delay for my kernel(green part) is almost 2 ms, which is also a lot. While the equivalence delay under kubuntu can increase to 7 ms.
Another thing confused me is that the same kernel has around 2 times performance difference between kubuntu and windows7. the kernel execution time in the figure is 10 ms
while the one under kubuntu is only 6ms.
I check the frequency of my GPU running under kubuntu and windows 7, both of them will boost to 423MHz. The driver of these two systems differ so much?
What is the actual complaint here? Is it start up time, execution time, or copy time?
Looking at your second profiling run, something seems broken as it's not possible for all the kernel executions and copies to happen in clReleaseCommandQueue if you are using blocking copies.
When I run your test in Windows, I see about 500ms of startup time before the first kernel is executed. Gaps between batches are around 100us. Also, the example code I am looking at doesn't use blocking copies, but does have clFinish() after each clEnqueueReadBuffer(), so there's no real difference between that and a blocking copy except for the extra API call.
There are some other problems here. You are calling clEnqueueWriteBuffer and clEnqueueReadBuffer without using pre-pinned host memory. This means the runtime has to pin and unpin the host memory each time. (We optimize to only pin and unpin a chunk of memory once per batch.) For small transfers we avoid the extra pin/unpin but it means doing two copies. I recommend that you change your upload/download paths to use the prepinned copy path as documented in the SDK docs. The basic gist is that you create a buffer with CL_MEM_ALLOC_HOST_PTR and use CL_MEM_READ_ONLY or CL_MEM_WRITE_ONLY for readback and upload buffers respectively. Then you can Map() the buffer for free and use that host ptr for your clEnqueueReadBuffer and clEnqueueWriteBuffer commands. You can even update the memory directly in that buffer (for the upload), avoiding the need for a memcpy to it.
Thank you for the reply.
My problem is the delay between kernels being added to the queue, and the kernels being executed.
My second profiling run used unblocking copies the entire time.
I'm not concerned about the initial startup time, but I am concerned about the gap between batches. On my system this gap is around 18ms (as opposed to your 100us). 100us is similar to what I get on my Nvidia card is is perfectly reasonable. 18ms is far too long however. This leads me to think that there might be a bug in the runtime environment with my chip and Ubuntu.
I have tried all the various ways to allocate and copy memory, including what you suggest, but this makes no noticeable difference to the delays - any slight speedup in the copying of memory is dwarfed by the submit->start delay.
Are you just looking at the submit->start delay for the first batch? Nothing will be executed on the GPU until the command buffer gets flushed. Otherwise, it looks like you are not flushing the commands to the GPU and they are just sitting in a buffer.
The submit->start delay is for every batch, as described in previous posts. The 18ms delay I am talking about is when I am using blocking writes and reads. As I understand the OpenCL specs, blocking writes/reads should flush the buffer. In any case, adding clFlushes makes no difference. I described what happens with non blocking reads/writes and flushing the buffer in an above post.
Thank you very much for your help! The output from your code on my machine is:
Average Launch Time for WriteBuffer Command(ms):0.722881
Average Launch Time for kernel1 Command(ms):8.09211
Average Launch Time for kernel2 Command(ms):23.3399
Average Launch Time for ReadBuffer Command(ms):0.577266
For a comparison, I also ran your changes on my machine with a comparable very low end nvidia card and got:
Average Launch Time for WriteBuffer Command(ms):0.135899
Average Launch Time for kernel1 Command(ms):0.150439
Average Launch Time for kernel2 Command(ms):0.208915
Average Launch Time for ReadBuffer Command(ms):0.144498
well this was interesting. Looks like these unusual delay is happening only with your APU. I have seen numbers better than nvidia's at my end, for discrete GPUs.
Also I checked http://www.fit-pc.com/web/fit-pc/fit-pc3-info/#Specifications, and it looks something in the lines of raspberry pi processors. I am not sure, if i will be able to arrange it here. Anyways will try.
You can probably check with the fit-pc, in case they can help. Another suggestion can be to try the performance with a older driver (probably 12.10). Note these are just guesses.
Thank you again for your assistance. I would expect numbers better than my nvidia results as my nvidia card is about the worst card nvidia makes that supports OpenCL. I tried several previous drivers including 12.10 and there was no difference at all in the delay. I have asked fit-pc to see if they can help.
I am interested to find out if this is a problem with some AMD APUs in general or if it is something specific about the fit-pc hardware that is causing it.
I will try to connect to some people. But IMHO such complaints should actually reach AMD from the manufacturer (fit-pc). Anyways, will let you know if i hear something useful. Thanks