I have recently moved an OpenCL application from a NVIDIA GPU to a Radeon HD 6320 Fusion running on Ubuntu 12.04, and it is unexpectedly running significantly slower.
My program copies a very large data structure on setup to the GPU (this data structure is never read or accessed by the CPU again), and then it:
After profiling both GPUs, the delay on the ATI GPU seems to be entirely from the first kernel being being added to the queue (CL_PROFILING_COMMAND_QUEUED) to the first kernel starting execution (CL_PROFILING_COMMAND_START). On the NVIDIA GPU, this takes a few microseconds each iteration. On the ATI GPU, this takes around 20ms each iteration, which is far too long for my use.
Is there any reason why I could be getting this large delay?
would you mind posting your code?
Posted in reply to yurtesen
let me take a look.
Things take slightly longer when you have the OpenCL profiling enabled. Did you try disabling the profiling and measure speed by using wall time? I guess the best way to determine what you are talking about is to see a sample code. There are just so many ways to create/copy buffers etc.
Also in the past, I had problems where OpenCL profiling timers were not as accurate as advertised, I wouldnt completely trust them. You might want to try to use a tool like AMD CodeXL as well and see if you are getting similar results from all your tests with different tools
I have disabled profiling - it only makes a small difference. Unfortunately the AMD CodeXL tool looks like in only supports 64 bit linux.
It is a bit hard to post my code as it is fairly tightly embedded in a much larger project, so I have copied the small part of it that contains the buffer creating and copying, and kernel queuing. The updateTrack function gets called around 30FPS for processing new data. It copies the new data to an existing buffer, then queues several kernels and a read buffer. My delay is between the first of the kernels being added to the queue and the first kernel starting to execute. This delay occurs each iteration of the while loop.
You could just use CL_TRUE with the final read and remove the queue finish() - although i doubt that would make any measurable difference.
FWIW i've seen excessively long queueing overheads with AMD hardware, much more so when any synchronisation occurs. I haven't found a reason or solution for this.
kernel overlapping doesnt work with profiling on. You should disable profiling and rerun.
Disabling profiling makes no difference unfortunately.
I am surprised that you are getting right results since you dont seem to be waiting for your kernels to finish running before reading the results with clEnqueueReadBuffer. There might be kernel related reasons for it to be running slower as well. Often kernels optimized on nvidia gpus run slower on amd gpus vice versa. As you can imagine, it is difficult to say what is going on if we cant get our hands on a working test program. Perhaps you might try to produce a small test case which can repeat the problem?
I didn't add the flag to allow out of order execution on the command queue when it is created so as I understand the memory read should happen after the kernels are finished, since the clEnqueueBuffer is added to the queue last.
The slowness isn't from the kernels themselves - it is between the kernels being added to the queue and the first kernel executing. Each kernel is taking around 0.5-1ms to execute, but it takes around 20ms for the first kernel to run.
It is not out of order execution if you start executing kernel, then start reading values from GPU (before waiting for kernel to finish). The events still will be executed the same order.
The kernel execution and data transfers can overlap even when the operations are done in order. Async. data transfer and kernel execution can still happen with in order execution. (AFAIK).
Well, I cant really comment on why there is slowness with the information you have given. As I mentioned, OpenCL profiling information is not always reliable. I wouldnt trust on that.
That is directly the opposite of what the OpenCL specification states - unless the out of order flag is set, one command must finish before the next one in the queue starts. http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clCreateCommandQueue.html
If ATI's profiler is inaccurate enough to be misleading me on what causes the slowdown, then it is totally useless. Surely the profiler can't report each kernel taking around 0.5ms and a setup time for the first kernel of 20ms when in reality each kernel is taking 5ms.
I'm not sure how the delay could be caused by anything not in the code I supplied. My suspicion is that it has something to do with the large buffer created at the start being copied or moved or something whenever kernels are added to the queue. Are there any differences in how AMD stores buffers when they are not being used, particularly on the Fusion chips, when compared to nvidia?
adrianr wrote:
That is directly the opposite of what the OpenCL specification states - unless the out of order flag is set, one command must finish before the next one in the queue starts. http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clCreateCommandQueue.html
You might be right, maybe I remembered it wrong.Sorry about that
Actually I accidentally stumbled upon something relate to this: see page 1-20
Also the example in OpenCL pages tak about objects shared between 2 kernel's not 1 kernel + readenqueue (although I am not sure if that was intentional or just a coincidence).
So, once more I am not sure if you are right or I am right but it might be good to be on the safe side....
profiling is quite unreliable see http://devgurus.amd.com/thread/159926
Yeah - not so useful: