Archives Discussions

ThomasUCF · ‎12-01-2010

device level parallel

Hi all,

Now I have two GPUs, I assign each with a context and a command queue.

I want them to run in parallel, meaning device level parallel. Will

clFlush(gpu0);

clFlush(gpu1);

clFinish(gpu0);

clFinish(gpu1);

give me parallel execution between the two gpus? Or gpu1 will start after gpu0 has finished?

Thanks in advance.

himanshu_gautam · ‎12-02-2010

thomasUCF,

If you use clFlush command the commands in that commandqueue are forced to start then and there and the control returns to main program.

So i think clFlush command is enough and you do not need clFinish command to run the program in parallel.

Any how i think you are using clFinish as a barrier to make sure the commandqueues are executed before we move forward. But this will inhibit us to use the CPU for that time.

ThomasUCF · ‎12-02-2010

Hi Himanshu:

Thanks, I'll use that to see if I can speed up my program with two GPUs.

ThomasUCF

dravisher · ‎12-02-2010

Edit: incomplete double post

dravisher · ‎12-02-2010

From the ATI Stream SDK OpenCL Programming Guide (rev. 1.05), page 4-44:

The AMD OpenCL implementation spawns a new thread to manage each
command queue. Thus, the OpenCL host code is free to manage multiple
devices from a single host thread. Note that clFinish is a blocking operation;
the thread that calls clFinish blocks until all commands in the specified
command-queue have been processed and completed. If the host thread is
managing multiple devices, it is important to call clFlush for each command-
queue before calling clFinish, so that the commands are flushed and execute in
parallel on the devices. Otherwise, the first call to clFinish blocks, the
commands on the other devices are not flushed, and the devices appear to
execute serially rather than in parallel.

However the standard is kind of unclear on whether this is necessarily going to be the behaviour. It just states that issued commands are guaranteed to be issued to the device. It does not guarantee that clFlush will not block (like clFinish does).

Also the standard states that commands like clEnqueueWriteBuffer and similar functions will issue a clFlush if the blocking parameter is true. However it seems to me that what they really do is issue clFinish, since they actually block untill the command is completed, not just untill it's issued to the device. This seems a bit inconsistent to me.

Also my experience with clFlush on a previous SDK was that it actually took just as long to return as clFinish (i.e. clFlush seemed to be blocking). I haven't tried this on the current SDK though, so perhaps this behaviour has changed (or something funky was happening on my system).

If clFlush does work as expected for you please let us know

himanshu_gautam · ‎12-05-2010

I enqueue a heavy kernel 30 times in a command queue.Then I check two cases while debugging:

1. call clFlush which returns almost immidiately.

2. call clFinish which takes about 4-5 seconds to return.

So i think they are working as expected.

nou · ‎12-05-2010

just a note. clEnqueuNDRange() is indeed "lazy". that mean it do not start execution until you call clFlush/clFinish.

Archives Discussions

clFlush and clFinish