6 Replies Latest reply on Mar 26, 2014 11:48 PM by amd_support

    AMD Driver, OpenCL Memory Model

    stevenovakov

      Hi All,

       

      I'm wondering if someone from the driver team would be able to clear something up for me:

       

           - What is the sequence of events from instantiating a buffer host-side, to copying data to that buffer, to it being used for computations on the device, to it being read, from a "where is the data" perspective?

       

      I have read Section 3.3 of CL 1.2, but I'm still not clear on it. I'll highlight what I think is happening, and this will show what I mean by "where is the data":

           1) I have some data in a container host-side

            2) I create a buffer. This buffer is instantiated host-side, (in host memory)

            3) I call enqueueWriteBuffer
      // THIS IS WHAT I"M UNSURE ABOUT

           4) enqueueWriteBuffer literally just does a mem copy HOST SIDE?

           5) I call enqueueNDRangeKernel, after kernel->setArg(that buffer)

            6) THE GRAPHICS DRIVER?   handles sending the instruction set, and copying all "cl::Buffer" objects in host memory to available blocks on the device

            7) Kernel Executes

            8) Kernel finishes executing and the GRAPHICS DRIVER copies the driver side buffers back host side (is this what is considered "done", for a CommandQueue:finish block?)

      //

           9) We call enqueueReadBuffer on all of the buffers and copy back to the original containers.


      See relevant discussion:
      c++ - OpenCL Buffer Instantiation in a Multi Device Environment - Stack Overflow

       

      If it hasn't been already, and I've just missed it, would it maybe be possible to highlight this somewhere in a nice little picture or something on your developer site? Thanks!

       

      Steve

        • Re: AMD Driver, OpenCL Memory Model
          Dithermaster

          When you allocate the buffer (step 2) it is allocating it on the GPU.

          When you call clEnqueueWriteBuffer you are asking the OpenCL runtime to copy host memory to device memory. This is a queued action and you can't free or re-use the host memory until the action completes (clFinish, or use an event, use a blocking write, or wait for a blocking command to finish).

          When you call clEnqueueNDRangeKernel you are asking the runtime to run the kernel. It won't start until the WriteBuffer command finishes, because the command queue is in-order.

          When you call clEnqueueReadBuffer you are asking the OpenCL runtime to copy device memory to host memory. This is a queued action and you shouldn't use the host memory until the action completes (clFinish, or use an event, use a blocking write, or wait for a blocking command to finish). The ReadBuffer won't start until the kernel is done because the command queue is in-order.

           

          Does that improve your understanding?

            • Re: AMD Driver, OpenCL Memory Model
              stevenovakov

              So all of that was my original understanding as well from reading the opencl spec way back when (year+ ago). And yet, It does not explain the following:

               

                this->buffer = new

                  cl::Buffer(

                    this->ocl_context,

                    CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR,

                    mem_size,

                    this->vector.data(),

                    NULL

                  );


              No command queue was specified here, no enqueueWriteBuffer is ever called on this buffer.  If it's as you say it is, then when I enqueueNDRangeKernel, my kernel would simply operate on garbage data in the allocated space, and yet, it does not. Somewhere, this data was copied to all devices. Also, I'm fairly sure the graphics driver "allocates on first use", and never before.

                • Re: AMD Driver, OpenCL Memory Model
                  igor420

                  Cheers stevenovakov, this helped me loads. I checked a bunch of AMD sites and i was beginning to pull my hair out.

                  • Re: AMD Driver, OpenCL Memory Model
                    amd_support

                    Hi Stave,

                        Your understanding that the buffer is "allocated on first use" is correct.

                        When multiple devices are attached to one OpenCL context, it is OpenCL runtime's job to ensure data consistency of a buffer associated with this context across multiple devices. In the cases where devices have different physical memories (as in case of discrete GPUs), openCL runtime allocates (this allocation is done on first use) space on different devices for same OpenCL buffer. The run-time maintains which device has updated the OpenCL buffer last, and copies the buffer to other devices based on their need. The run-time algorithms take care that such data copy across devices is minimized, and take place concurrently when devices are busy executing kernels.

                        Hope this helps.

                     

                    AMD Support.

                    1 of 1 people found this helpful
                      • Re: AMD Driver, OpenCL Memory Model
                        stevenovakov

                        Awesome, but before I call it case closed, would you mind just confirming whether all, (or if not all, then which), of the cl::CommandQueue::  methods count as "use", in the "allocate on first use" paradigm, from the POV of the runtime/driver?

                          • Re: AMD Driver, OpenCL Memory Model
                            amd_support

                            Hi Stave,

                                It is not known how OpenCL run-time internally manages dirty and clean buffers on various devices and how and which high level APIs, like methods of cl::CommandQueue use them. However it makes little difference to application programmer. The only point that could be made is that run-time ensures data consistency of an OpenCL buffer across all the devices in a context, and in the interest of speed, takes care to transfer data as and when needed, and if possible concurrent to kernel execution.

                                Please let me know if this answers your query.

                             

                            Thanks,

                            AMD Support