cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Zoltan_Maric
Journeyman III

clFlush efficiency

I recently learned that enqueuing non-blocking commands requires explicit calling of clFlush.

My question is: Is it more efficient to load the queue with all the commands I wish to dispatch - and flush at the end, or is it better to flush after each command? The commands being

  • write to input buffer (size 225280B)
  • execute 2 kernels working on the same read-only buffer
  • reading from the 2 result buffers (sizes 880B and 14080B)
0 Likes
5 Replies
nou
Exemplar

IMHO in current implementation is clFlush() == clFinish()

but i think that calling clFlush() after block of commands is better.

0 Likes

Originally posted by: nou IMHO in current implementation is clFlush() == clFinish()

 

but i think that calling clFlush() after block of commands is better.

 

Yes, it seems that NVidia is currently the only implementation where clFlush does not block, making clFinish just a clFlush wrapper.  OSX is the same as ATI.  IBM's current behavior is unknown.

In your situation, there is probably not a big difference, but it seems like just trying both ways would not be that hard.

The big impact of the ATI & OSX implementations is if you wish to have a multi-GPU application that is portable, it must use a thread per device approach, or it is just going to toggle between blocking on the devices.

Originally posted by: nou IMHO in current implementation is clFlush() == clFinish()

 

but i think that calling clFlush() after block of commands is better.

 

So let me get this straight: enqueuing non-blocking functions does not set them off, clFlush sets them off, but also blocks? How are you supposed to implement a non-blocking function then?

Also, my tests have shown that what you said may not be true. I called clFlush after queuing all the commands needed. Did some work in the host program, and then called clWaitForEvents just before I needed the results.

I timed the duration of the clWaitForEvents call, and it turned out to be around 4-5 ms. When I remove flushing at the end of enqueuing, clWaitForEvents blocked for 14-15 ms. So you could say I managed to do what I intended.

Now I've also timed the duration of clFlush and clFinish at the end of enqueuing:

  • clFlush: 0-1 ms
  • clFinish: 15-16 ms

I believe I have presented enough evidence to prove that clFlush is NOT a wrapper for clFinish or vice versa

0 Likes

BTW, I am working with the newest ATI implementation, on a Radeon HD 5830.

And to answer my own question: calling one clFlush after all the enqueuing is much faster than calling it after enqueuing each command

0 Likes

Originally posted by: Zoltan.Maric BTW, I am working with the newest ATI implementation, on a Radeon HD 5830.

 

And to answer my own question: calling one clFlush after all the enqueuing is much faster than calling it after enqueuing each command

 

I am glad you got your answer, and clFlush was not observed blocking.  For stuff that is easy to experimentally test, it is usually a good idea to do so yourself.  Asking on a forum is also good, but not always definitive or sometimes the info is stale (especially searching old threads).  Sometimes for tough problems though, forum feedback is difficult to ignore.

I went to double check I had my Netbeans CPU profiling results for this on OSX, but could not find them.  (Netbeans has a great profiler that can find hotspots, and hierarchically track cpu and # of calls by method, by thread.  It displays the hierarchies in a tree / table format, which is great for drill down.  Only problem, you cannot give the results a name on save unless you are saving externally to the project, so I do not keep them long.)

Re-trying, I did not observe OSX clFlush effectively blocking.  No doubt few on this forum care what that platform does, but I like to correct myself as neccessary.  A search engine would still pick this up.

0 Likes