Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Initialize buffers within a context with multiple devices


I want to multiply two matrices, doing C = A * B.
Now, I've three OpenCL devices in my context (an AMD CPU and 2 AMD GPUs, one integrated and one discrete) and I'd like to make the computation/data to be splitted among all of them.
The target is that each device uses the whole matrix B and a slice (some rows of A) to compute a slice of C.

While writing the code, I discovered that clCreateBuffer creates a buffer which is shared among all those devices. Anyway, I've two question I was not able to answer.

1) How allocation on a particular device is correlated with the call to clCreateBufferFunction? Does the allocation happen as soon as the function is called of when the buffer is initialized or, again, when the content of the buffer is firstly used by the kernel? Since I'm using a discrete GPU and an integrated one, it would be nice to allocate a buffer for B on the pinned host memory, so that a copy is (or may be) allocated on the discrete card, but no allocation is made on the integrated GPU (would be better to rely on memory sharing to access B without copy). Are the current driver implementing some smart strategies like this even in a context with integrated and discrete multiple GPUs?

2) What is the behaviour of buffer initialization? Do I have to do clEnqueueWriteBuffer (or clEnqueueMapBuffer + write) for each device or it is enough to do it once (and all the devices will see the content put into the buffer)?

Thank you very much!

23 Replies

buffers is created on device when it is first accesed with kernel. where exactly is stored is staded in AMD OpenCL programing guide. you need enqueue only one write. also you need synchronize acces to buffer with events if kernels change content of buffer.


Thank you very much for your answer.

Anyway, I think that the OpenCL API is actually misleading: why the enqueue/map buffer functions require to specify a device id if the buffer is replicated/shared among all the devices in the context? I think that those functions should have a signature similare to createBuffer, where only the context is required. Am I wrong?


Ok, probably if a device changes the content of the buffer I can't retrieve that content through another device (without synchronization). In this case the device id to read/write a buffer makes sense.


Sorry, I meant the command queue that, if I'm not wrong, is uniquvocally associated to a device. Therefore, if I have a buffer shared between two devices and I want the host to initialize it, it makes no difference the command queue (the one associated to first devices, or that associated to the second) that I use to enqueue a buffer write. Right?


I dont know for sure. From the OpenCL documentation, as far as I understand, it looks like you should queue map/write commands to both queues. See:


yes you need enqueue only one read/write when you share buffer. just be sure to set proper CL_MEM_READ/WRITE flag as OpenCL use this flag to determine which buffers is need to synchronise across devices.

so when is buffer shared use any queue which device will use this buffer. when is buffer device specific use queue that device. also there is cl_ext_migrate_memobject where you can explicit state on which device you want buffer.


nou wrote:

yes you need enqueue only one read/write when you share buffer. just be sure to set proper CL_MEM_READ/WRITE flag as OpenCL use this flag to determine which buffers is need to synchronise across devices.

so when is buffer shared use any queue which device will use this buffer. when is buffer device specific use queue that device. also there is cl_ext_migrate_memobject where you can explicit state on which device you want buffer.

I wonder, why the example in the SDK is creating a queue and enqueue buffers for each device then?

SimpleMultiDevice.cpp , see the loops for(int i = 0; i < numGPUDevices; i++)

Also the map function is used per device in BinomialOptionMultiGPU example.

Perhaps it is a bug if the single enqueue command causes buffers to be written to all the devices. Otherwise it makes no sense to make a "queue" per device and enqueue buffer copy to "queue" if all the operations are executed on all devices anyway...

or am I missing something?

Journeyman III

Ok, I confirm that one only write has to be done by the host. Anyway, for some memory flags combinations the initialization is not propagated among the devices.
For example, if the two devices are the integrated GPU (AMD 6550D) and the discrete one (HD 5870) only with ALLOC_HOST_PTR and buffer map/unmap I obtain the initialization sharing.

If the two devices are a CPU and a GPU, also CL_MEM_USE_PERSISTENT_MEM_AMD works.

I'll check if there some bugs into my programs.
By now I ask you: there are any requirements in buffer memory flags to allow buffer writes (by the host) to be propagated to all the devices in the shared context?

Thank you very much!


I'm performing a vector addition (y = ax + y) and I'm trying to allocate x and y so that they are shared between two devices.
I verified that only using ALLOC_HOST_PTR the buffer sharing works. In other cases (no flags, CL_USE_PERSISTENT_MEM_AMD) it works only for some vector sizes.

Really strange.

Maybe I'm doing something wrong in synchronizing the executions on devices.
I show you the relevant part of the host code:

cl_command_queue default_queue = pf_data.device_settings[0]->Queue()->Queue();

cl_int err;

//alloc vectors


     x_pointer = new float[whole_data_size];


    y_pointer = new float[whole_data_size];

//context is shared, src_flags and dst_flags are the mem flags for buffer x and y

x_buffer = clCreateBuffer(context, src_flags, whole_data_size * sizeof(float), NULL, &err);

y_buffer = clCreateBuffer(context, dst_flags, whole_data_size * sizeof(float), NULL, &err);

//init data (default queue is the command queue of the first device)


    x_pointer = (float*)clEnqueueMapBuffer(default_queue, x_buffer, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * whole_data_size, 0, NULL, NULL, &err);


    y_pointer = (float*)clEnqueueMapBuffer(default_queue, y_buffer, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * whole_data_size, 0, NULL, NULL, &err);

//Init vector pair initializes the content of the two vectors in a fixed way

initVectorPair(x_pointer, y_pointer, whole_data_size);


     err |= clEnqueueUnmapMemObject(default_queue, x_buffer, x_pointer, 0, NULL, NULL);


    err |= clEnqueueWriteBuffer(default_queue, x_buffer, CL_TRUE, 0, whole_data_size * sizeof(float), x_pointer, 0, NULL, NULL);


    err |= clEnqueueUnmapMemObject(default_queue, y_buffer, y_pointer, 0, NULL, NULL);


    err |= clEnqueueWriteBuffer(default_queue, y_buffer, CL_TRUE, 0, whole_data_size * sizeof(float), y_pointer, 0, NULL, NULL);

//set arguments

unsigned int offset = 0;

for(unsigned int i = 0; i < devices.size(); i++) {

     //I use a data structure to store the global and local sizes for each device

    unsigned int global_size = devices->GlobalSizes()[0];

    unsigned int local_size = devices->LocalSizes()[0];

    cl_kernel kernel = devices->Kernel()->Kernel();

    err |= clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&x_buffer);

    err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&y_buffer);

    err |= clSetKernelArg(kernel, 2, sizeof(float), (void*)&a);

    err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), (void*)&offset);

    err |= clSetKernelArg(kernel, 4, sizeof(unsigned int), (void*)&global_size);

    err |= clEnqueueNDRangeKernel(devices->Queue()->Queue(), kernel, 1, NULL, &global_size, &local_size, 0, NULL, &completion_events);

    err |= clFlush(devices->Queue()->Queue());

     //the offset of the next device is the first element not i the range of the elements computed by the current device

    offset += data_sizes;


for(unsigned int i = 0; i < devices.size(); i++)

    err |= clFinish(devices->Queue()->Queue());


    y_pointer = (float*)clEnqueueMapBuffer(default_queue, y_buffer, CL_TRUE, CL_MAP_READ, 0, sizeof(float) * whole_data_size, 0, NULL, NULL, &err);


    err |= clEnqueueReadBuffer(default_queue, y_buffer, CL_TRUE, 0, whole_data_size * sizeof(float), y_pointer, 0, NULL, NULL);

Journeyman III

It seems I'm getting some coherency problems similar to those reported here:


cadorino wrote:

It seems I'm getting some coherency problems similar to those reported here:

See the examples BinomialOptionMultiGPU and SimpleMultiDevice in the SDK samples. There is a queue per device and buffers are copied per device...


see appendix A of OpenCL specification.

you must get event object from queue which modify buffer and call clFlush/clFinish and then in second queue you must pass event from first queue as wait event.


Event objects only contain information about the status of queued events.  clWaitForEvents is not associated with queues, it simply waits until the event(s) is/are completed. (I dont see how that will help or related to sharing data besides from a synchronization standpoint?). The reference manual says events should be used for synchronizing operations which 'modify' the buffers. (so you wont be writing data from one device to a buffer while reading from another). Actually appendix says that you dont only have to enqueue read/writes, you also have to make sure that they are synchronized in a non-conflicting way

The Appendix A says you can share the memory object, but it does not say that it will be automatically copied to all the devices. A memory object (buffer) is independent of any queue, it is associated with the context. (therefore can be shared within same context). But you still have to "share" the contents of the memory between devices by enqueuing the right read/write commands..

So, to wrap it up, when you create a shared buffer object with clCreateBuffer, you can use the same object in multiple command queues, but you must still use clEnqueue[Read|Write]Buffer calls on each queue (device). Because data wont be copied/shared automatically between all devices within same context. This is exactly what the OpenCL SDK samples do as well...


Well, I saw some examples but I thought it was not the only way to use shared contexts and buffers.
I tried to write a vector addition algorithm, with one context, one program, multiple queues but only one initialization/result read (using the queue associated to the first device).

Result: this works, but only with some memory flags for the buffers. ALLOC_HOST_POINTER is the only reliable flag.
No flag (allocation on device) works for 2 GPUs but not for CPU + GPU combinations.
Flag USE_PERSISTENT_MEM_AMD works "randomly" (only for some vector sizes) on CPU + GPU but it doesn't work at all for 2 GPUs.

Ok, buffers can be intantiated and read for each device of the context, but it is a really strange usage: if I have to enqueue a write for each device to instantiate shared buffers and to enqueue a read for each device to retrieve results stored in shared buffers, why these buffers are called "shared"? I would do the same if I had separated contexts and buffers for each device. Therefore, I wonder, which is the key role of contexts created specifying multiple devices?


As far as I understand, this is because the buffer object is shared within the context. So you can synchronize reads/writes to it. (see page 15 for context in the pdf mentioned below). When you have separate contexts, you cant synchronize memory operations for example (event objects are not shared...)

Lets look with an example, think about a object with read/write access by the devices. 1st device updates value X and 2nd device updates value X too. How do you propose OpenCL to automatically synchronize these operations? How will OpenCL know if you wanted 1st device to work on X first then the 2nd device is suppose to work on the updated X or other way around?

"The results of modifying a shared resource in one command-queue while it is being used byanother command-queue are undefined."

So you should use events to make sue that your queues do not read/write at the same time unless if that is what you want. First you will queue read/write events to 1st device and tell the command in queue of 2nd device to wait for the events of the 1st device to finish. Then you can read the updated value to 2nd device and work on it etc.

I suspect that when you map a memory area to a device, for certain memory types, the device copies the results back even before you queue the command to write them (or it updates values in place if zerocopy is involved), so that is why it sometimes appears to function with certain types. I myself had a similar problem where an OpenCL program worked fine on GPU but not on CPU, and well I had wrong stuff in my code


Thank you to have spent so much time for a detailed answer

Anyway, someone (like you) tells me that initialization must be done for each device that share the buffer, someone else, like nou, tells that one map/enqueueWrite is enough.

I don't know the right answer. What I know is that vector addition (and also matrix multiplication) works fine with one only map/enqueueWrite if I allocate the shared buffers with ALLOC_HOST_PTR. If it was not planned that initialization is propagated among the context devices, the correctness of these computations would be a sort of miracle

Access mode (read/write) is an interesting suggestion. By now, matrices multiplied are READ_ONLY, the computed matrix is WRITE_ONLY. In the afternoon I'll try to give the kernels RW grants on all the buffers and check what happens.


The solution is simple. There is a tie breaker Follow the examples in the SDK. The MultiGPU examples enqueue copy of data per device. At least I would hope that the people who wrote the examples know something about OpenCL and did not do unnecessary operations

About your program, it would not be a miracle that it works even if it is not correct. But did you check that you are getting the exact same results everytime you run it? (I would run several times and compare "all" the results). Simply the program executing without crashing does not mean that it is giving right answers

It is not a miracle because, if you are using a CPU with host pointer or when you use CL_MEM_USE_PERSISTENT_MEM_AMD, the data is not copied anyway.  See:

See 4-11

This means if you enqueue read/writes it will do nothing anyway (so in a sense it will give you the illusion that these are not necessary), but you still have to make sure that multiple devices do not write to same area at the same time by using event objects, which means you should still enqueue stuff. Otherwise your program will work on some devices, using some SDKs only.

So, technically speaking, your program might accidentally work for now, this does not mean that it is portable or compatible with other devices and SDKs. When you run it on another SDK which does something differently, it will simply wont work, crash or give wrong results (if it is not giving wrong results already) or will stop working with the next version of SDK.


there is bug when you mix CPU and GPU device

when you specify data dependency between kernel invocations you should get coherent data.


I dont understand your point? It is not a confirmed bug, it is possible that there is an error in his program as well.

Can you provide code which shows this behavior?


my point is that AMD guy said it may be a bug. so he is expecting that buffers are synchronised between devices. also cadorino said that for two GPU and buffer without flag it worked. no flag mean CL_MEM_READ_ONLY so that buffer reside in device memory and it must be intentionally copied to return correct results.

try this code:

#include <iostream>

#include <CL/cl.h>

#include <CL/cl_ext.h>

#include <CL/cl.hpp>

#include <fstream>

using namespace std;

const char * helloStr  = "//#pragma OPENCL EXTENSION cl_ext_atomic_counters_32 : enable\n"

                          "__kernel void "

                          "hello(__global int *a, "

                          " __global int *b, "

                          "__global int *c"

                          ")\n "

                          "{ "

                          "size_t id = get_global_id(0);"

                          "c[id] = a[id]+b[id]; \n"

                          "} ";

int main()


    cl_int err;

    std::vector<cl::Platform> platforms;


    if (platforms.size() == 0)


        std::cout << "Platform size 0\n";

        return -1;


    cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[0](), 0};

    cl::Context context(CL_DEVICE_TYPE_ALL, properties);

    std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();

    for(auto i = devices.begin(); i!=devices.end();i++)

    cout << i->getInfo<CL_DEVICE_NAME>() << endl;

    cl::Program::Sources source(1, std::make_pair(helloStr, strlen(helloStr)));

    cl::Program program_ = cl::Program(context, source);;

    cl::Kernel kernel(program_, "hello", &err);

    int *dataA = new int[100000];

    int *dataB = new int[100000];

    int *dataB2 = new int[100000];

    int *dataC = new int[100000];

    for(int i=0;i<100000;i++)dataA = rand()%10000;

    cl::Buffer bufA(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 400000, dataA);

    for(int i=0;i<100000;i++)dataB = rand()%10000;

    cl::Buffer bufB(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 400000, dataB);

    for(int i=0;i<100000;i++)dataB2 = rand()%10000;

    cl::Buffer bufB2(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 400000, dataB2);

    cl::Buffer bufC(context, CL_MEM_READ_WRITE, 400000);

    cl::Buffer bufC2(context, CL_MEM_READ_WRITE, 400000);

    for(int i=0;i<100000;i++)dataC = dataA+dataB+dataB2;

    kernel.setArg(0, bufA);

    kernel.setArg(1, bufB);

    kernel.setArg(2, bufC);

    cl::Event event, event2;

    cl_command_queue_properties queue_prop = 0;

    cl::CommandQueue queue1(context, devices[0], queue_prop, &err);

    cl::CommandQueue queue2(context, devices[1], queue_prop, &err);

    queue1.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(100000), cl::NullRange, NULL, &event);

    vector<cl::Event> event_list;


    kernel.setArg(0, bufC);

    kernel.setArg(1, bufB2);

    kernel.setArg(2, bufC2);

    queue2.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(100000), cl::NullRange, &event_list, &event2);



    queue1.enqueueReadBuffer(bufC2, CL_TRUE, 0, 400000, dataA, &event_list);


    for(int i=0;i<100000;i++)




            cout<<"ERROR " << i << " it should be " << dataC << " result is "<< dataA << endl;




    /*for(int i=0;i<10;i++)


            cout<< i << " " << dataC << " " << dataA << endl;


    return 0;


it run two kernels on two devices. and it is working correctly on my system.


You are right, it functions... Even if one uses CL_MEM_USE_HOST_PTR, instead of CL_MEM_COPY_HOST_PTR .. even when bufC and bufC2 are set as READ_ONLY. Everything works... but I am not convinced that this is a feature and if synchronization is guaranteed 100% or not.

Thanks for the code. It is very interesting!!!