Hello,
i have to following problem:
My application does not scale for multiple GPUs. It always is a bit slower on more GPUs than on less.
I could figure out, that a cl::Buffer-Object is causing this. I use the Buffer as follows:
First I create a usual array with malloc() which includes 20 elements (they are filled later):
int* pOverlap_region = (int*) malloc(80);
After it is filled I create the Buffer-Object:
cl::Buffer overlap_region = cl::Buffer::Buffer(
this->context.getOpenCLContext(), CL_MEM_COPY_HOST_PTR, 80, pOverlap_region, &err);
this->context.getOpenCLContext() returns the context.
Then it is set as an argument for the kernel:
err |= kernel.setArg(3, (cl::Buffer) overlap_region);
If this Buffer is created and *not* set as an argument, the application scales on multi-GPU.
Does anybody know why the behaviour is like this?
Thanks for your replies
try add CL_MEM_READ_ONLY to flags if you just read from this buffer. becuase runtime can sychronize this buffer across multiple GPUs. so it lead to serialization of the work.
Hey nou,
thanks for the fast reply.
I added the CL_MEM_READ_ONLY flag, but there is no change in behavior...
do you use another buffer in kernel?
yes, there are two more buffers: They are created like this:
err = context.getCommandQueue(i).enqueueWriteBuffer(*devicePtrs, CL_FALSE,
0,
sizePerDevice,
(void*)(((char*)hostPtr)+offset) );
i is the device-ID given by the context, hostPtr is a pointer to the data.
Two Buffers are added like that above.
Do you think, this has something to do with "my" Buffer?
quastion is if this others buffer are used per GPU or are shared across all GPUs. BTW are that buffer shared on all GPUs?
The others Buffers are split and n/d elements are passed to each device (n being the input elements, d being the devices).
"My" Buffer has different data for every GPU and is passed to every GPU (it is a Buffer, in which adjacent data which resides on another GPU is stored).
Are you trying to send the same buffer to all the GPUs or you are sending corrosponding subBuffes to each GPU?
It would be nice if you can post some testcase and your system information: CPU,GPU,SDK,DRIVER,OS.
Thanks