EnqueueWriteBuffer for multiple Devices

Discussion created by centershocksb12 on Apr 2, 2011
Latest reply on Apr 4, 2011 by Jawed

In a multi-GPU environment, I experience problems with the enqueueWriteBuffer-method. The situation is as follows:

In my method to prepare the data, I create as many buffers as devices occur in my context (Context is a class, in which Context and CommandQueues for each device are created, device is the device-ID returned by the Context-class). I only post the parts of the code, which I think cause the problems.
cl::vector overlap_regions;
for (device = 0; device < participatingDevices; device++) {
      overlap_regions[device] = new cl::Buffer(this->context.getOpenCLContext(),CL_MEM_READ_ONLY,sizeof(T) * overlap_range * 2, NULL, &err);

This is simply done to allocate device memory.
Following the for-loop, I create the data I want to pass to the devices using the above Buffer. I use an array of size 2*overlap_range*participatingDevices*sizeof(T). This array is supposed to be split since only some data is needed on each device (The first 2*overlap_range elements are needed on the first device, the next 2*overlap_range elements are needed on the second device, and so on).
So I call the enqueueWriteBuffer-methods for each device as follows:
for (device = 0; device < participatingDevices; device++) {
      size_t size = 2 * overlap_range * sizeof(T);
      offset = device * 2 * overlap_range * sizeof(T);
      err = this->context.getCommandQueue(device).enqueueWriteBuffer(
            *overlap_regions[device], CL_FALSE, 0, size,
            (void*) (pOverlap_region + offset), NULL, NULL);

The enqueueWriteBuffer-methods return CL_SUCCESS every time (this is in my code, but I skipped it here).
In the called executeKernel(device)-method the kernel is actually executed for the passed device. The above created Buffer are set as argument as follows (the other arguments are skipped):
err |= kernel.setArg(3, *(this->overlap_regions[device]));

When I run the programm after compilation, it works fine and correct for one device. But when I use two or more devices, it seems that the enqueueWriteBuffer-methods do not work for the second and following devices. Still, the calculation on the first device is correct.
I also tried to block enqueueWriteBuffer with CL_TRUE-flag or waited for the CommandQueue to finish after the call. None worked.
I cannot figure out what causes the problems. I can give additional information, when needed. The behaviour is only tested on a NVIDIA Tesla plattform, since it is the only one I can access which has multiple devices (4). It will most likely occur on another platform too. I appreciate your hints or help...