cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

centershocksb12
Journeyman III

EnqueueWriteBuffer for multiple Devices

Hello,

In a multi-GPU environment, I experience problems with the enqueueWriteBuffer-method. The situation is as follows:

In my method to prepare the data, I create as many buffers as devices occur in my context (Context is a class, in which Context and CommandQueues for each device are created, device is the device-ID returned by the Context-class). I only post the parts of the code, which I think cause the problems.
Code:
cl::vector overlap_regions;
for (device = 0; device < participatingDevices; device++) {
      overlap_regions[device] = new cl::Buffer(this->context.getOpenCLContext(),CL_MEM_READ_ONLY,sizeof(T) * overlap_range * 2, NULL, &err);
   }

This is simply done to allocate device memory.
Following the for-loop, I create the data I want to pass to the devices using the above Buffer. I use an array of size 2*overlap_range*participatingDevices*sizeof(T). This array is supposed to be split since only some data is needed on each device (The first 2*overlap_range elements are needed on the first device, the next 2*overlap_range elements are needed on the second device, and so on).
So I call the enqueueWriteBuffer-methods for each device as follows:
Code:
for (device = 0; device < participatingDevices; device++) {
      size_t size = 2 * overlap_range * sizeof(T);
      offset = device * 2 * overlap_range * sizeof(T);
      err = this->context.getCommandQueue(device).enqueueWriteBuffer(
            *overlap_regions[device], CL_FALSE, 0, size,
            (void*) (pOverlap_region + offset), NULL, NULL);
      executeKernel(device);
   }

The enqueueWriteBuffer-methods return CL_SUCCESS every time (this is in my code, but I skipped it here).
In the called executeKernel(device)-method the kernel is actually executed for the passed device. The above created Buffer are set as argument as follows (the other arguments are skipped):
Code:
err |= kernel.setArg(3, *(this->overlap_regions[device]));

When I run the programm after compilation, it works fine and correct for one device. But when I use two or more devices, it seems that the enqueueWriteBuffer-methods do not work for the second and following devices. Still, the calculation on the first device is correct.
I also tried to block enqueueWriteBuffer with CL_TRUE-flag or waited for the CommandQueue to finish after the call. None worked.
I cannot figure out what causes the problems. I can give additional information, when needed. The behaviour is only tested on a NVIDIA Tesla plattform, since it is the only one I can access which has multiple devices (4). It will most likely occur on another platform too. I appreciate your hints or help...
0 Likes
4 Replies
Jawed
Adept II

First, try running your code for the 1st device (not the 0th), e.g. change the loop for copying and execution to start at 1 instead of 0. Or change the loop so that it only iterates once.

0 Likes

I think you will have better at nvidia forums for this.

You seem to be passing the *(this->overlap_regions[device]) instead of a overlap_regions[device].

0 Likes

Hi,

I tried both your approaches (also combined):

himanshu.gautam's approach brought me a CL_INVALID_MEM_OBJECT error when calling kernel.setArg.

Jawed's approach:

I started the loop from device=1. For 1 Device nothing was done on the GPU since the kernel was not started. In multi-GPU case still the Buffers seem not to be copied onto the appropriate device.

Any other ideas? Thanks!

0 Likes

Why didn't the kernel start?

I suggest you debug the pointers. It seems to me that your 0th device is working when it shouldn't, by pure luck. Any time stuff only works for the zeroth case should make you look at the pointers.

In other words I think this is a class problem not an OpenCL problem.

0 Likes