cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

cadorino
Journeyman III

Initialize buffers within a context with multiple devices

Hi.

I want to multiply two matrices, doing C = A * B.
Now, I've three OpenCL devices in my context (an AMD CPU and 2 AMD GPUs, one integrated and one discrete) and I'd like to make the computation/data to be splitted among all of them.
The target is that each device uses the whole matrix B and a slice (some rows of A) to compute a slice of C.

While writing the code, I discovered that clCreateBuffer creates a buffer which is shared among all those devices. Anyway, I've two question I was not able to answer.

1) How allocation on a particular device is correlated with the call to clCreateBufferFunction? Does the allocation happen as soon as the function is called of when the buffer is initialized or, again, when the content of the buffer is firstly used by the kernel? Since I'm using a discrete GPU and an integrated one, it would be nice to allocate a buffer for B on the pinned host memory, so that a copy is (or may be) allocated on the discrete card, but no allocation is made on the integrated GPU (would be better to rely on memory sharing to access B without copy). Are the current driver implementing some smart strategies like this even in a context with integrated and discrete multiple GPUs?

2) What is the behaviour of buffer initialization? Do I have to do clEnqueueWriteBuffer (or clEnqueueMapBuffer + write) for each device or it is enough to do it once (and all the devices will see the content put into the buffer)?

Thank you very much!

0 Likes
23 Replies
nou
Exemplar

Initialize buffers within a context with multiple devices

buffers is created on device when it is first accesed with kernel. where exactly is stored is staded in AMD OpenCL programing guide. you need enqueue only one write. also you need synchronize acces to buffer with events if kernels change content of buffer.

0 Likes
cadorino
Journeyman III

Re: Initialize buffers within a context with multiple devices

Thank you very much for your answer.

Anyway, I think that the OpenCL API is actually misleading: why the enqueue/map buffer functions require to specify a device id if the buffer is replicated/shared among all the devices in the context? I think that those functions should have a signature similare to createBuffer, where only the context is required. Am I wrong?

0 Likes
cadorino
Journeyman III

Re: Initialize buffers within a context with multiple devices

Ok, probably if a device changes the content of the buffer I can't retrieve that content through another device (without synchronization). In this case the device id to read/write a buffer makes sense.

0 Likes
yurtesen
Miniboss

Re: Initialize buffers within a context with multiple devices

0 Likes
cadorino
Journeyman III

Re: Initialize buffers within a context with multiple devices

Sorry, I meant the command queue that, if I'm not wrong, is uniquvocally associated to a device. Therefore, if I have a buffer shared between two devices and I want the host to initialize it, it makes no difference the command queue (the one associated to first devices, or that associated to the second) that I use to enqueue a buffer write. Right?

0 Likes
yurtesen
Miniboss

Re: Initialize buffers within a context with multiple devices

I dont know for sure. From the OpenCL documentation, as far as I understand, it looks like you should queue map/write commands to both queues. See:
http://www.codeproject.com/Articles/201258/Part-5-OpenCL-Buffers-and-Memory-Affinity

0 Likes
nou
Exemplar

Re: Initialize buffers within a context with multiple devices

yes you need enqueue only one read/write when you share buffer. just be sure to set proper CL_MEM_READ/WRITE flag as OpenCL use this flag to determine which buffers is need to synchronise across devices.

so when is buffer shared use any queue which device will use this buffer. when is buffer device specific use queue that device. also there is cl_ext_migrate_memobject where you can explicit state on which device you want buffer. http://www.khronos.org/registry/cl/extensions/ext/cl_ext_migrate_memobject.txt

0 Likes
cadorino
Journeyman III

Initialize buffers within a context with multiple devices

Ok, I confirm that one only write has to be done by the host. Anyway, for some memory flags combinations the initialization is not propagated among the devices.
For example, if the two devices are the integrated GPU (AMD 6550D) and the discrete one (HD 5870) only with ALLOC_HOST_PTR and buffer map/unmap I obtain the initialization sharing.

If the two devices are a CPU and a GPU, also CL_MEM_USE_PERSISTENT_MEM_AMD works.


I'll check if there some bugs into my programs.
By now I ask you: there are any requirements in buffer memory flags to allow buffer writes (by the host) to be propagated to all the devices in the shared context?

Thank you very much!

0 Likes
cadorino
Journeyman III

Re: Initialize buffers within a context with multiple devices

I'm performing a vector addition (y = ax + y) and I'm trying to allocate x and y so that they are shared between two devices.
I verified that only using ALLOC_HOST_PTR the buffer sharing works. In other cases (no flags, CL_USE_PERSISTENT_MEM_AMD) it works only for some vector sizes.

Really strange.

Maybe I'm doing something wrong in synchronizing the executions on devices.
I show you the relevant part of the host code:

cl_command_queue default_queue = pf_data.device_settings[0]->Queue()->Queue();

cl_int err;

//alloc vectors

if(!map_src)

     x_pointer = new float[whole_data_size];

if(!map_dst)

    y_pointer = new float[whole_data_size];

//context is shared, src_flags and dst_flags are the mem flags for buffer x and y

x_buffer = clCreateBuffer(context, src_flags, whole_data_size * sizeof(float), NULL, &err);

y_buffer = clCreateBuffer(context, dst_flags, whole_data_size * sizeof(float), NULL, &err);

//init data (default queue is the command queue of the first device)

if(map_src)

    x_pointer = (float*)clEnqueueMapBuffer(default_queue, x_buffer, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * whole_data_size, 0, NULL, NULL, &err);

if(map_dst)

    y_pointer = (float*)clEnqueueMapBuffer(default_queue, y_buffer, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * whole_data_size, 0, NULL, NULL, &err);

//Init vector pair initializes the content of the two vectors in a fixed way

initVectorPair(x_pointer, y_pointer, whole_data_size);

if(map_src)

     err |= clEnqueueUnmapMemObject(default_queue, x_buffer, x_pointer, 0, NULL, NULL);

else

    err |= clEnqueueWriteBuffer(default_queue, x_buffer, CL_TRUE, 0, whole_data_size * sizeof(float), x_pointer, 0, NULL, NULL);

if(map_dst)

    err |= clEnqueueUnmapMemObject(default_queue, y_buffer, y_pointer, 0, NULL, NULL);

else

    err |= clEnqueueWriteBuffer(default_queue, y_buffer, CL_TRUE, 0, whole_data_size * sizeof(float), y_pointer, 0, NULL, NULL);

//set arguments

unsigned int offset = 0;

for(unsigned int i = 0; i < devices.size(); i++) {

     //I use a data structure to store the global and local sizes for each device

    unsigned int global_size = devices->GlobalSizes()[0];

    unsigned int local_size = devices->LocalSizes()[0];

    cl_kernel kernel = devices->Kernel()->Kernel();

    err |= clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&x_buffer);

    err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&y_buffer);

    err |= clSetKernelArg(kernel, 2, sizeof(float), (void*)&a);

    err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), (void*)&offset);

    err |= clSetKernelArg(kernel, 4, sizeof(unsigned int), (void*)&global_size);

    err |= clEnqueueNDRangeKernel(devices->Queue()->Queue(), kernel, 1, NULL, &global_size, &local_size, 0, NULL, &completion_events);

    err |= clFlush(devices->Queue()->Queue());

     //the offset of the next device is the first element not i the range of the elements computed by the current device

    offset += data_sizes;

}

for(unsigned int i = 0; i < devices.size(); i++)

    err |= clFinish(devices->Queue()->Queue());

if(map_dst)

    y_pointer = (float*)clEnqueueMapBuffer(default_queue, y_buffer, CL_TRUE, CL_MAP_READ, 0, sizeof(float) * whole_data_size, 0, NULL, NULL, &err);

else

    err |= clEnqueueReadBuffer(default_queue, y_buffer, CL_TRUE, 0, whole_data_size * sizeof(float), y_pointer, 0, NULL, NULL);

0 Likes