cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

adrianr
Journeyman III

Delay between OpenCL kernel being queued and being run

I have recently moved an OpenCL application from a NVIDIA GPU to a Radeon HD 6320 Fusion running on Ubuntu 12.04, and it is unexpectedly running significantly slower.

My program copies a very large data structure on setup to the GPU (this data structure is never read or accessed by the CPU again), and then it:     

  1. Queues several kernels and a read buffer (to copy a very small data structure back to main memory).
  2. Calls clFinish to wait for the kernels and the read buffer to complete.
  3. This continually repeats, with occasionally some extra data copied depending on what information is returned by the read buffer (this means the read buffer has to complete before the next round of kernels can be added to the queue).

After profiling both GPUs, the delay on the ATI GPU seems to be entirely from the first kernel being being added to the queue (CL_PROFILING_COMMAND_QUEUED) to the first kernel starting execution (CL_PROFILING_COMMAND_START). On the NVIDIA GPU, this takes a few microseconds each iteration. On the ATI GPU, this takes around 20ms each iteration, which is far too long for my use.

Is there any reason why I could be getting this large delay?

0 Likes
16 Replies
binying
Challenger

would you mind posting your code?

0 Likes

Posted in reply to yurtesen

0 Likes

let me take  a  look.

0 Likes
yurtesen
Miniboss

Things take slightly longer when you have the OpenCL profiling enabled. Did you try disabling the profiling and measure speed by using wall time? I guess the best way to determine what you are talking about is to see a sample code. There are just so many ways to create/copy buffers etc.

Also in the past, I had problems where OpenCL profiling timers were not as accurate as advertised, I wouldnt completely trust them. You might want to try to use a tool like AMD CodeXL as well and see if you are getting similar results from all your tests with different tools

0 Likes

I have disabled profiling - it only makes a small difference. Unfortunately the AMD CodeXL tool looks like in only supports 64 bit linux.

It is a bit hard to post my code as it is fairly tightly embedded in a much larger project, so I have copied the small part of it that contains the buffer creating and copying, and kernel queuing. The updateTrack function gets called around 30FPS for processing new data. It copies the new data to an existing buffer, then queues several kernels and a read buffer. My delay is between the first of the kernels being added to the queue and the first kernel starting to execute. This delay occurs each iteration of the while loop.

0 Likes

You could just use CL_TRUE with the final read and remove the queue finish() - although i doubt that would make any measurable difference.

FWIW i've seen excessively long queueing overheads with AMD hardware, much more so when any synchronisation occurs.  I haven't found a reason or solution for this.

0 Likes
mosheg
Journeyman III

kernel overlapping doesnt work with profiling on. You should disable profiling and rerun.

0 Likes

Disabling profiling makes no difference unfortunately.

0 Likes
yurtesen
Miniboss

I am surprised that you are getting right results since you dont seem to be waiting for your kernels to finish running before reading the results with clEnqueueReadBuffer. There might be kernel related reasons for it to be running slower as well. Often kernels optimized on nvidia gpus run slower on amd gpus vice versa. As you can imagine, it is difficult to say what is going on if we cant get our hands on a working test program. Perhaps you might try to produce a small test case which can repeat the problem?

0 Likes

I didn't add the flag to allow out of order execution on the command queue when it is created so as I understand the memory read should happen after the kernels are finished, since the clEnqueueBuffer is added to the queue last.

The slowness isn't from the kernels themselves - it is between the kernels being added to the queue and the first kernel executing. Each kernel is taking around 0.5-1ms to execute, but it takes around 20ms for the first kernel to run.

0 Likes

It is not out of order execution if you start executing kernel, then start reading values from GPU (before waiting for kernel to finish). The events still will be executed the same order.

The kernel execution and data transfers can overlap even when the operations are done in order.  Async. data transfer and kernel execution can still happen with in order execution. (AFAIK).

Well, I cant really comment on why there is slowness with the information you have given. As I mentioned, OpenCL profiling information is not always reliable. I wouldnt trust on that.

0 Likes

That is directly the opposite of what the OpenCL specification states - unless the out of order flag is set, one command must finish before the next one in the queue starts. http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clCreateCommandQueue.html

If ATI's profiler is inaccurate enough to be misleading me on what causes the slowdown, then it is totally useless. Surely the profiler can't report each kernel taking around 0.5ms and a setup time for the first kernel of 20ms when in reality each kernel is taking 5ms.

I'm not sure how the delay could be caused by anything not in the code I supplied. My suspicion is that it has something to do with the large buffer created at the start being copied or moved or something whenever kernels are added to the queue. Are there any differences in how AMD stores buffers when they are not being used, particularly on the Fusion chips, when compared to nvidia?

0 Likes

adrianr wrote:

That is directly the opposite of what the OpenCL specification states - unless the out of order flag is set, one command must finish before the next one in the queue starts. http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clCreateCommandQueue.html

You might be right, maybe I remembered it wrong.Sorry about that

0 Likes

Actually I accidentally stumbled upon something relate to this: see page 1-20

http://developer.amd.com.php53-23.ord1-1.websitetestlink.com/wordpress/media/2012/10/AMD_Accelerated...

Also the example in OpenCL pages tak about objects shared between 2 kernel's not 1 kernel + readenqueue (although I am not sure if that was intentional or just a coincidence).

So, once more I am not sure if you are right or I am right but it might be good to be on the safe side....

0 Likes

profiling is quite unreliable see http://devgurus.amd.com/thread/159926

0 Likes

Yeah - not so useful:

Unauthorized

 
   
Access to this place or content is restricted. If you think this is a mistake, please contact your administrator or the person who directed you here.           
0 Likes