If you use clFlush command the commands in that commandqueue are forced to start then and there and the control returns to main program.
So i think clFlush command is enough and you do not need clFinish command to run the program in parallel.
Any how i think you are using clFinish as a barrier to make sure the commandqueues are executed before we move forward. But this will inhibit us to use the CPU for that time.
Thanks, I'll use that to see if I can speed up my program with two GPUs.
Edit: incomplete double post
From the ATI Stream SDK OpenCL Programming Guide (rev. 1.05), page 4-44:
The AMD OpenCL implementation spawns a new thread to manage each
command queue. Thus, the OpenCL host code is free to manage multiple
devices from a single host thread. Note that clFinish is a blocking operation;
the thread that calls clFinish blocks until all commands in the specified
command-queue have been processed and completed. If the host thread is
managing multiple devices, it is important to call clFlush for each command-
queue before calling clFinish, so that the commands are flushed and execute in
parallel on the devices. Otherwise, the first call to clFinish blocks, the
commands on the other devices are not flushed, and the devices appear to
execute serially rather than in parallel.
However the standard is kind of unclear on whether this is necessarily going to be the behaviour. It just states that issued commands are guaranteed to be issued to the device. It does not guarantee that clFlush will not block (like clFinish does).
Also the standard states that commands like clEnqueueWriteBuffer and similar functions will issue a clFlush if the blocking parameter is true. However it seems to me that what they really do is issue clFinish, since they actually block untill the command is completed, not just untill it's issued to the device. This seems a bit inconsistent to me.
Also my experience with clFlush on a previous SDK was that it actually took just as long to return as clFinish (i.e. clFlush seemed to be blocking). I haven't tried this on the current SDK though, so perhaps this behaviour has changed (or something funky was happening on my system).
If clFlush does work as expected for you please let us know
I enqueue a heavy kernel 30 times in a command queue.Then I check two cases while debugging:
1. call clFlush which returns almost immidiately.
2. call clFinish which takes about 4-5 seconds to return.
So i think they are working as expected.
just a note. clEnqueuNDRange() is indeed "lazy". that mean it do not start execution until you call clFlush/clFinish.