i have to following problem:
My application does not scale for multiple GPUs. It always is a bit slower on more GPUs than on less.
I could figure out, that a cl::Buffer-Object is causing this. I use the Buffer as follows:
First I create a usual array with malloc() which includes 20 elements (they are filled later):
int* pOverlap_region = (int*) malloc(80);
After it is filled I create the Buffer-Object:
cl::Buffer overlap_region = cl::Buffer::Buffer(
this->context.getOpenCLContext(), CL_MEM_COPY_HOST_PTR, 80, pOverlap_region, &err);
this->context.getOpenCLContext() returns the context.
Then it is set as an argument for the kernel:
err |= kernel.setArg(3, (cl::Buffer) overlap_region);
If this Buffer is created and *not* set as an argument, the application scales on multi-GPU.
Does anybody know why the behaviour is like this?
Thanks for your replies
yes, there are two more buffers: They are created like this:
err = context.getCommandQueue(i).enqueueWriteBuffer(*devicePtrs, CL_FALSE,
i is the device-ID given by the context, hostPtr is a pointer to the data.
Two Buffers are added like that above.
Do you think, this has something to do with "my" Buffer?
The others Buffers are split and n/d elements are passed to each device (n being the input elements, d being the devices).
"My" Buffer has different data for every GPU and is passed to every GPU (it is a Buffer, in which adjacent data which resides on another GPU is stored).
Are you trying to send the same buffer to all the GPUs or you are sending corrosponding subBuffes to each GPU?
It would be nice if you can post some testcase and your system information: CPU,GPU,SDK,DRIVER,OS.