4 Replies Latest reply on Apr 4, 2011 7:36 AM by Jawed

    EnqueueWriteBuffer for multiple Devices

    centershocksb12
      Hello,

      In a multi-GPU environment, I experience problems with the enqueueWriteBuffer-method. The situation is as follows:

      In my method to prepare the data, I create as many buffers as devices occur in my context (Context is a class, in which Context and CommandQueues for each device are created, device is the device-ID returned by the Context-class). I only post the parts of the code, which I think cause the problems.
      Code:
      cl::vector overlap_regions;
      for (device = 0; device < participatingDevices; device++) {
            overlap_regions[device] = new cl::Buffer(this->context.getOpenCLContext(),CL_MEM_READ_ONLY,sizeof(T) * overlap_range * 2, NULL, &err);
         }

      This is simply done to allocate device memory.
      Following the for-loop, I create the data I want to pass to the devices using the above Buffer. I use an array of size 2*overlap_range*participatingDevices*sizeof(T). This array is supposed to be split since only some data is needed on each device (The first 2*overlap_range elements are needed on the first device, the next 2*overlap_range elements are needed on the second device, and so on).
      So I call the enqueueWriteBuffer-methods for each device as follows:
      Code:
      for (device = 0; device < participatingDevices; device++) {
            size_t size = 2 * overlap_range * sizeof(T);
            offset = device * 2 * overlap_range * sizeof(T);
            err = this->context.getCommandQueue(device).enqueueWriteBuffer(
                  *overlap_regions[device], CL_FALSE, 0, size,
                  (void*) (pOverlap_region + offset), NULL, NULL);
            executeKernel(device);
         }

      The enqueueWriteBuffer-methods return CL_SUCCESS every time (this is in my code, but I skipped it here).
      In the called executeKernel(device)-method the kernel is actually executed for the passed device. The above created Buffer are set as argument as follows (the other arguments are skipped):
      Code:
      err |= kernel.setArg(3, *(this->overlap_regions[device]));

      When I run the programm after compilation, it works fine and correct for one device. But when I use two or more devices, it seems that the enqueueWriteBuffer-methods do not work for the second and following devices. Still, the calculation on the first device is correct.
      I also tried to block enqueueWriteBuffer with CL_TRUE-flag or waited for the CommandQueue to finish after the call. None worked.
      I cannot figure out what causes the problems. I can give additional information, when needed. The behaviour is only tested on a NVIDIA Tesla plattform, since it is the only one I can access which has multiple devices (4). It will most likely occur on another platform too. I appreciate your hints or help...