23 Replies Latest reply on Mar 9, 2012 12:42 PM by nou

    Initialize buffers within a context with multiple devices

    cadorino

      Hi.

      I want to multiply two matrices, doing C = A * B.
      Now, I've three OpenCL devices in my context (an AMD CPU and 2 AMD GPUs, one integrated and one discrete) and I'd like to make the computation/data to be splitted among all of them.
      The target is that each device uses the whole matrix B and a slice (some rows of A) to compute a slice of C.

       

      While writing the code, I discovered that clCreateBuffer creates a buffer which is shared among all those devices. Anyway, I've two question I was not able to answer.

       

      1) How allocation on a particular device is correlated with the call to clCreateBufferFunction? Does the allocation happen as soon as the function is called of when the buffer is initialized or, again, when the content of the buffer is firstly used by the kernel? Since I'm using a discrete GPU and an integrated one, it would be nice to allocate a buffer for B on the pinned host memory, so that a copy is (or may be) allocated on the discrete card, but no allocation is made on the integrated GPU (would be better to rely on memory sharing to access B without copy). Are the current driver implementing some smart strategies like this even in a context with integrated and discrete multiple GPUs?

       

      2) What is the behaviour of buffer initialization? Do I have to do clEnqueueWriteBuffer (or clEnqueueMapBuffer + write) for each device or it is enough to do it once (and all the devices will see the content put into the buffer)?

       

      Thank you very much!

        • Initialize buffers within a context with multiple devices
          nou

          buffers is created on device when it is first accesed with kernel. where exactly is stored is staded in AMD OpenCL programing guide. you need enqueue only one write. also you need synchronize acces to buffer with events if kernels change content of buffer.

          • Initialize buffers within a context with multiple devices
            cadorino

            Ok, I confirm that one only write has to be done by the host. Anyway, for some memory flags combinations the initialization is not propagated among the devices.
            For example, if the two devices are the integrated GPU (AMD 6550D) and the discrete one (HD 5870) only with ALLOC_HOST_PTR and buffer map/unmap I obtain the initialization sharing.

            If the two devices are a CPU and a GPU, also CL_MEM_USE_PERSISTENT_MEM_AMD works.


            I'll check if there some bugs into my programs.
            By now I ask you: there are any requirements in buffer memory flags to allow buffer writes (by the host) to be propagated to all the devices in the shared context?

             

            Thank you very much!

              • Re: Initialize buffers within a context with multiple devices
                cadorino

                I'm performing a vector addition (y = ax + y) and I'm trying to allocate x and y so that they are shared between two devices.
                I verified that only using ALLOC_HOST_PTR the buffer sharing works. In other cases (no flags, CL_USE_PERSISTENT_MEM_AMD) it works only for some vector sizes.

                Really strange.

                 

                Maybe I'm doing something wrong in synchronizing the executions on devices.
                I show you the relevant part of the host code:

                 

                cl_command_queue default_queue = pf_data.device_settings[0]->Queue()->Queue();

                cl_int err;

                 

                //alloc vectors

                if(!map_src)

                     x_pointer = new float[whole_data_size];

                if(!map_dst)

                    y_pointer = new float[whole_data_size];

                 

                //context is shared, src_flags and dst_flags are the mem flags for buffer x and y

                x_buffer = clCreateBuffer(context, src_flags, whole_data_size * sizeof(float), NULL, &err);

                y_buffer = clCreateBuffer(context, dst_flags, whole_data_size * sizeof(float), NULL, &err);

                 

                //init data (default queue is the command queue of the first device)

                if(map_src)

                    x_pointer = (float*)clEnqueueMapBuffer(default_queue, x_buffer, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * whole_data_size, 0, NULL, NULL, &err);

                if(map_dst)

                    y_pointer = (float*)clEnqueueMapBuffer(default_queue, y_buffer, CL_TRUE, CL_MAP_WRITE, 0, sizeof(float) * whole_data_size, 0, NULL, NULL, &err);

                //Init vector pair initializes the content of the two vectors in a fixed way

                initVectorPair(x_pointer, y_pointer, whole_data_size);

                if(map_src)

                     err |= clEnqueueUnmapMemObject(default_queue, x_buffer, x_pointer, 0, NULL, NULL);

                else

                    err |= clEnqueueWriteBuffer(default_queue, x_buffer, CL_TRUE, 0, whole_data_size * sizeof(float), x_pointer, 0, NULL, NULL);

                if(map_dst)

                    err |= clEnqueueUnmapMemObject(default_queue, y_buffer, y_pointer, 0, NULL, NULL);

                else

                    err |= clEnqueueWriteBuffer(default_queue, y_buffer, CL_TRUE, 0, whole_data_size * sizeof(float), y_pointer, 0, NULL, NULL);

                 

                //set arguments

                unsigned int offset = 0;

                for(unsigned int i = 0; i < devices.size(); i++) {

                     //I use a data structure to store the global and local sizes for each device

                    unsigned int global_size = devices[i]->GlobalSizes()[0];

                    unsigned int local_size = devices[i]->LocalSizes()[0];

                 

                    cl_kernel kernel = devices[i]->Kernel()->Kernel();

                    err |= clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&x_buffer);

                    err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&y_buffer);

                    err |= clSetKernelArg(kernel, 2, sizeof(float), (void*)&a);

                    err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), (void*)&offset);

                    err |= clSetKernelArg(kernel, 4, sizeof(unsigned int), (void*)&global_size);

                    err |= clEnqueueNDRangeKernel(devices[i]->Queue()->Queue(), kernel, 1, NULL, &global_size, &local_size, 0, NULL, &completion_events[i]);

                    err |= clFlush(devices[i]->Queue()->Queue());

                     //the offset of the next device is the first element not i the range of the elements computed by the current device

                    offset += data_sizes[i];

                }

                 

                for(unsigned int i = 0; i < devices.size(); i++)

                    err |= clFinish(devices[i]->Queue()->Queue());

                 

                if(map_dst)

                    y_pointer = (float*)clEnqueueMapBuffer(default_queue, y_buffer, CL_TRUE, CL_MAP_READ, 0, sizeof(float) * whole_data_size, 0, NULL, NULL, &err);

                else

                    err |= clEnqueueReadBuffer(default_queue, y_buffer, CL_TRUE, 0, whole_data_size * sizeof(float), y_pointer, 0, NULL, NULL);

              • Re: Initialize buffers within a context with multiple devices
                cadorino

                It seems I'm getting some coherency problems similar to those reported here:

                http://devgurus.amd.com/thread/158515

                  • Re: Initialize buffers within a context with multiple devices
                    yurtesen

                    cadorino wrote:

                     

                    It seems I'm getting some coherency problems similar to those reported here:

                    http://devgurus.amd.com/thread/158515

                    See the examples BinomialOptionMultiGPU and SimpleMultiDevice in the SDK samples. There is a queue per device and buffers are copied per device...

                      • Re: Initialize buffers within a context with multiple devices
                        nou

                        see appendix A of OpenCL specification.

                        you must get event object from queue which modify buffer and call clFlush/clFinish and then in second queue you must pass event from first queue as wait event.

                          • Re: Initialize buffers within a context with multiple devices
                            yurtesen

                            Event objects only contain information about the status of queued events.  clWaitForEvents is not associated with queues, it simply waits until the event(s) is/are completed. (I dont see how that will help or related to sharing data besides from a synchronization standpoint?). The reference manual says events should be used for synchronizing operations which 'modify' the buffers. (so you wont be writing data from one device to a buffer while reading from another). Actually appendix says that you dont only have to enqueue read/writes, you also have to make sure that they are synchronized in a non-conflicting way

                             

                            The Appendix A says you can share the memory object, but it does not say that it will be automatically copied to all the devices. A memory object (buffer) is independent of any queue, it is associated with the context. (therefore can be shared within same context). But you still have to "share" the contents of the memory between devices by enqueuing the right read/write commands..

                             

                            So, to wrap it up, when you create a shared buffer object with clCreateBuffer, you can use the same object in multiple command queues, but you must still use clEnqueue[Read|Write]Buffer calls on each queue (device). Because data wont be copied/shared automatically between all devices within same context. This is exactly what the OpenCL SDK samples do as well...

                          • Re: Initialize buffers within a context with multiple devices
                            cadorino

                            Well, I saw some examples but I thought it was not the only way to use shared contexts and buffers.
                            I tried to write a vector addition algorithm, with one context, one program, multiple queues but only one initialization/result read (using the queue associated to the first device).

                            Result: this works, but only with some memory flags for the buffers. ALLOC_HOST_POINTER is the only reliable flag.
                            No flag (allocation on device) works for 2 GPUs but not for CPU + GPU combinations.
                            Flag USE_PERSISTENT_MEM_AMD works "randomly" (only for some vector sizes) on CPU + GPU but it doesn't work at all for 2 GPUs.

                             

                            Ok, buffers can be intantiated and read for each device of the context, but it is a really strange usage: if I have to enqueue a write for each device to instantiate shared buffers and to enqueue a read for each device to retrieve results stored in shared buffers, why these buffers are called "shared"? I would do the same if I had separated contexts and buffers for each device. Therefore, I wonder, which is the key role of contexts created specifying multiple devices?

                              • Re: Initialize buffers within a context with multiple devices
                                yurtesen

                                As far as I understand, this is because the buffer object is shared within the context. So you can synchronize reads/writes to it. (see page 15 for context in the pdf mentioned below). When you have separate contexts, you cant synchronize memory operations for example (event objects are not shared...)

                                 

                                Lets look with an example, think about a object with read/write access by the devices. 1st device updates value X and 2nd device updates value X too. How do you propose OpenCL to automatically synchronize these operations? How will OpenCL know if you wanted 1st device to work on X first then the 2nd device is suppose to work on the updated X or other way around?

                                 

                                http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf

                                "The results of modifying a shared resource in one command-queue while it is being used byanother command-queue are undefined."

                                 

                                So you should use events to make sue that your queues do not read/write at the same time unless if that is what you want. First you will queue read/write events to 1st device and tell the command in queue of 2nd device to wait for the events of the 1st device to finish. Then you can read the updated value to 2nd device and work on it etc.

                                 

                                I suspect that when you map a memory area to a device, for certain memory types, the device copies the results back even before you queue the command to write them (or it updates values in place if zerocopy is involved), so that is why it sometimes appears to function with certain types. I myself had a similar problem where an OpenCL program worked fine on GPU but not on CPU, and well I had wrong stuff in my code

                                  • Re: Initialize buffers within a context with multiple devices
                                    cadorino

                                    Thank you to have spent so much time for a detailed answer

                                    Anyway, someone (like you) tells me that initialization must be done for each device that share the buffer, someone else, like nou, tells that one map/enqueueWrite is enough.

                                    I don't know the right answer. What I know is that vector addition (and also matrix multiplication) works fine with one only map/enqueueWrite if I allocate the shared buffers with ALLOC_HOST_PTR. If it was not planned that initialization is propagated among the context devices, the correctness of these computations would be a sort of miracle

                                     

                                    Access mode (read/write) is an interesting suggestion. By now, matrices multiplied are READ_ONLY, the computed matrix is WRITE_ONLY. In the afternoon I'll try to give the kernels RW grants on all the buffers and check what happens.

                                      • Re: Initialize buffers within a context with multiple devices
                                        yurtesen

                                        The solution is simple. There is a tie breaker Follow the examples in the SDK. The MultiGPU examples enqueue copy of data per device. At least I would hope that the people who wrote the examples know something about OpenCL and did not do unnecessary operations

                                         

                                        About your program, it would not be a miracle that it works even if it is not correct. But did you check that you are getting the exact same results everytime you run it? (I would run several times and compare "all" the results). Simply the program executing without crashing does not mean that it is giving right answers

                                         

                                        It is not a miracle because, if you are using a CPU with host pointer or when you use CL_MEM_USE_PERSISTENT_MEM_AMD, the data is not copied anyway.  See:

                                        http://developer.amd.com/sdks/amdappsdk/assets/amd_accelerated_parallel_processing_opencl_programming_guide.pdf

                                        See 4-11

                                        This means if you enqueue read/writes it will do nothing anyway (so in a sense it will give you the illusion that these are not necessary), but you still have to make sure that multiple devices do not write to same area at the same time by using event objects, which means you should still enqueue stuff. Otherwise your program will work on some devices, using some SDKs only.

                                         

                                        So, technically speaking, your program might accidentally work for now, this does not mean that it is portable or compatible with other devices and SDKs. When you run it on another SDK which does something differently, it will simply wont work, crash or give wrong results (if it is not giving wrong results already) or will stop working with the next version of SDK.

                                          • Re: Initialize buffers within a context with multiple devices
                                            nou

                                            there is bug when you mix CPU and GPU device http://devgurus.amd.com/thread/158515

                                            when you specify data dependency between kernel invocations you should get coherent data.

                                              • Re: Initialize buffers within a context with multiple devices
                                                yurtesen

                                                I dont understand your point? It is not a confirmed bug, it is possible that there is an error in his program as well.

                                                Can you provide code which shows this behavior?

                                                  • Re: Initialize buffers within a context with multiple devices
                                                    nou

                                                    my point is that AMD guy said it may be a bug. so he is expecting that buffers are synchronised between devices. also cadorino said that for two GPU and buffer without flag it worked. no flag mean CL_MEM_READ_ONLY so that buffer reside in device memory and it must be intentionally copied to return correct results.

                                                    try this code:

                                                    #include <iostream>
                                                    #include <CL/cl.h>
                                                    #include <CL/cl_ext.h>
                                                    #include <CL/cl.hpp>
                                                    #include <fstream>
                                                    
                                                    using namespace std;
                                                    
                                                    const char * helloStr  = "//#pragma OPENCL EXTENSION cl_ext_atomic_counters_32 : enable\n"
                                                                              "__kernel void "
                                                                              "hello(__global int *a, "
                                                                              " __global int *b, "
                                                                              "__global int *c"
                                                                              ")\n "
                                                                              "{ "
                                                                              "size_t id = get_global_id(0);"
                                                                              "c[id] = a[id]+b[id]; \n"
                                                                              "} ";
                                                    
                                                    int main()
                                                    {
                                                        cl_int err;
                                                        std::vector<cl::Platform> platforms;
                                                        cl::Platform::get(&platforms);
                                                        if (platforms.size() == 0)
                                                        {
                                                            std::cout << "Platform size 0\n";
                                                            return -1;
                                                        }
                                                    
                                                        cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[0](), 0};
                                                    
                                                        cl::Context context(CL_DEVICE_TYPE_ALL, properties);
                                                    
                                                    
                                                        std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
                                                    
                                                        for(auto i = devices.begin(); i!=devices.end();i++)
                                                        cout << i->getInfo<CL_DEVICE_NAME>() << endl;
                                                    
                                                        cl::Program::Sources source(1, std::make_pair(helloStr, strlen(helloStr)));
                                                        cl::Program program_ = cl::Program(context, source);
                                                        program_.build(devices);
                                                    
                                                        cl::Kernel kernel(program_, "hello", &err);
                                                    
                                                        int *dataA = new int[100000];
                                                        int *dataB = new int[100000];
                                                        int *dataB2 = new int[100000];
                                                        int *dataC = new int[100000];
                                                        for(int i=0;i<100000;i++)dataA[i] = rand()%10000;
                                                        cl::Buffer bufA(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 400000, dataA);
                                                        for(int i=0;i<100000;i++)dataB[i] = rand()%10000;
                                                        cl::Buffer bufB(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 400000, dataB);
                                                        for(int i=0;i<100000;i++)dataB2[i] = rand()%10000;
                                                        cl::Buffer bufB2(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 400000, dataB2);
                                                        cl::Buffer bufC(context, CL_MEM_READ_WRITE, 400000);
                                                        cl::Buffer bufC2(context, CL_MEM_READ_WRITE, 400000);
                                                    
                                                        for(int i=0;i<100000;i++)dataC[i] = dataA[i]+dataB[i]+dataB2[i];
                                                    
                                                        kernel.setArg(0, bufA);
                                                        kernel.setArg(1, bufB);
                                                        kernel.setArg(2, bufC);
                                                    
                                                        cl::Event event, event2;
                                                        cl_command_queue_properties queue_prop = 0;
                                                        cl::CommandQueue queue1(context, devices[0], queue_prop, &err);
                                                        cl::CommandQueue queue2(context, devices[1], queue_prop, &err);
                                                        queue1.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(100000), cl::NullRange, NULL, &event);
                                                    
                                                        vector<cl::Event> event_list;
                                                        event_list.push_back(event);
                                                        kernel.setArg(0, bufC);
                                                        kernel.setArg(1, bufB2);
                                                        kernel.setArg(2, bufC2);
                                                        queue2.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(100000), cl::NullRange, &event_list, &event2);
                                                    
                                                        event_list.clear();
                                                        event_list.push_back(event2);
                                                        queue1.enqueueReadBuffer(bufC2, CL_TRUE, 0, 400000, dataA, &event_list);
                                                        //queue1.finish();
                                                    
                                                        for(int i=0;i<100000;i++)
                                                        {
                                                            if(dataC[i]!=dataA[i])
                                                            {
                                                                cout<<"ERROR " << i << " it should be " << dataC[i] << " result is "<< dataA[i] << endl;
                                                                break;
                                                            }
                                                        }
                                                    
                                                        /*for(int i=0;i<10;i++)
                                                        {
                                                                cout<< i << " " << dataC[i] << " " << dataA[i] << endl;
                                                        }*/
                                                        return 0;
                                                    }
                                                    
                                                    

                                                    it run two kernels on two devices. and it is working correctly on my system.