I recently learned that enqueuing non-blocking commands requires explicit calling of clFlush.
My question is: Is it more efficient to load the queue with all the commands I wish to dispatch - and flush at the end, or is it better to flush after each command? The commands being
IMHO in current implementation is clFlush() == clFinish()
but i think that calling clFlush() after block of commands is better.
Originally posted by: nou IMHO in current implementation is clFlush() == clFinish()
but i think that calling clFlush() after block of commands is better.
Yes, it seems that NVidia is currently the only implementation where clFlush does not block, making clFinish just a clFlush wrapper. OSX is the same as ATI. IBM's current behavior is unknown.
In your situation, there is probably not a big difference, but it seems like just trying both ways would not be that hard.
The big impact of the ATI & OSX implementations is if you wish to have a multi-GPU application that is portable, it must use a thread per device approach, or it is just going to toggle between blocking on the devices.
Originally posted by: nou IMHO in current implementation is clFlush() == clFinish()
but i think that calling clFlush() after block of commands is better.
So let me get this straight: enqueuing non-blocking functions does not set them off, clFlush sets them off, but also blocks? How are you supposed to implement a non-blocking function then?
Also, my tests have shown that what you said may not be true. I called clFlush after queuing all the commands needed. Did some work in the host program, and then called clWaitForEvents just before I needed the results.
I timed the duration of the clWaitForEvents call, and it turned out to be around 4-5 ms. When I remove flushing at the end of enqueuing, clWaitForEvents blocked for 14-15 ms. So you could say I managed to do what I intended.
Now I've also timed the duration of clFlush and clFinish at the end of enqueuing:
I believe I have presented enough evidence to prove that clFlush is NOT a wrapper for clFinish or vice versa
BTW, I am working with the newest ATI implementation, on a Radeon HD 5830.
And to answer my own question: calling one clFlush after all the enqueuing is much faster than calling it after enqueuing each command
Originally posted by: Zoltan.Maric BTW, I am working with the newest ATI implementation, on a Radeon HD 5830.
And to answer my own question: calling one clFlush after all the enqueuing is much faster than calling it after enqueuing each command
I am glad you got your answer, and clFlush was not observed blocking. For stuff that is easy to experimentally test, it is usually a good idea to do so yourself. Asking on a forum is also good, but not always definitive or sometimes the info is stale (especially searching old threads). Sometimes for tough problems though, forum feedback is difficult to ignore.
I went to double check I had my Netbeans CPU profiling results for this on OSX, but could not find them. (Netbeans has a great profiler that can find hotspots, and hierarchically track cpu and # of calls by method, by thread. It displays the hierarchies in a tree / table format, which is great for drill down. Only problem, you cannot give the results a name on save unless you are saving externally to the project, so I do not keep them long.)
Re-trying, I did not observe OSX clFlush effectively blocking. No doubt few on this forum care what that platform does, but I like to correct myself as neccessary. A search engine would still pick this up.