Archives Discussions

Zoltan_Maric · ‎06-09-2010

I recently learned that enqueuing non-blocking commands requires explicit calling of clFlush.

My question is: Is it more efficient to load the queue with all the commands I wish to dispatch - and flush at the end, or is it better to flush after each command? The commands being

write to input buffer (size 225280B)
execute 2 kernels working on the same read-only buffer
reading from the 2 result buffers (sizes 880B and 14080B)

nou · ‎06-09-2010

IMHO in current implementation is clFlush() == clFinish()

but i think that calling clFlush() after block of commands is better.

jcpalmer · ‎06-09-2010

Originally posted by: nou IMHO in current implementation is clFlush() == clFinish()

but i think that calling clFlush() after block of commands is better.

Yes, it seems that NVidia is currently the only implementation where clFlush does not block, making clFinish just a clFlush wrapper. OSX is the same as ATI. IBM's current behavior is unknown.

In your situation, there is probably not a big difference, but it seems like just trying both ways would not be that hard.

The big impact of the ATI & OSX implementations is if you wish to have a multi-GPU application that is portable, it must use a thread per device approach, or it is just going to toggle between blocking on the devices.

Zoltan_Maric · ‎06-09-2010

Originally posted by: nou IMHO in current implementation is clFlush() == clFinish()

but i think that calling clFlush() after block of commands is better.

So let me get this straight: enqueuing non-blocking functions does not set them off, clFlush sets them off, but also blocks? How are you supposed to implement a non-blocking function then?

Also, my tests have shown that what you said may not be true. I called clFlush after queuing all the commands needed. Did some work in the host program, and then called clWaitForEvents just before I needed the results.

I timed the duration of the clWaitForEvents call, and it turned out to be around 4-5 ms. When I remove flushing at the end of enqueuing, clWaitForEvents blocked for 14-15 ms. So you could say I managed to do what I intended.

Now I've also timed the duration of clFlush and clFinish at the end of enqueuing:

clFlush: 0-1 ms
clFinish: 15-16 ms

I believe I have presented enough evidence to prove that clFlush is NOT a wrapper for clFinish or vice versa

Zoltan_Maric · ‎06-09-2010

BTW, I am working with the newest ATI implementation, on a Radeon HD 5830.

And to answer my own question: calling one clFlush after all the enqueuing is much faster than calling it after enqueuing each command

jcpalmer · ‎06-10-2010

Originally posted by: Zoltan.Maric BTW, I am working with the newest ATI implementation, on a Radeon HD 5830.

And to answer my own question: calling one clFlush after all the enqueuing is much faster than calling it after enqueuing each command

I am glad you got your answer, and clFlush was not observed blocking. For stuff that is easy to experimentally test, it is usually a good idea to do so yourself. Asking on a forum is also good, but not always definitive or sometimes the info is stale (especially searching old threads). Sometimes for tough problems though, forum feedback is difficult to ignore.

I went to double check I had my Netbeans CPU profiling results for this on OSX, but could not find them. (Netbeans has a great profiler that can find hotspots, and hierarchically track cpu and # of calls by method, by thread. It displays the hierarchies in a tree / table format, which is great for drill down. Only problem, you cannot give the results a name on save unless you are saving externally to the project, so I do not keep them long.)

Re-trying, I did not observe OSX clFlush effectively blocking. No doubt few on this forum care what that platform does, but I like to correct myself as neccessary. A search engine would still pick this up.

Archives Discussions

clFlush efficiency