cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

stevenovakov
Adept I

AMD Driver, OpenCL Memory Model

Hi All,

I'm wondering if someone from the driver team would be able to clear something up for me:

     - What is the sequence of events from instantiating a buffer host-side, to copying data to that buffer, to it being used for computations on the device, to it being read, from a "where is the data" perspective?

I have read Section 3.3 of CL 1.2, but I'm still not clear on it. I'll highlight what I think is happening, and this will show what I mean by "where is the data":

     1) I have some data in a container host-side

      2) I create a buffer. This buffer is instantiated host-side, (in host memory)

      3) I call enqueueWriteBuffer
// THIS IS WHAT I"M UNSURE ABOUT

     4) enqueueWriteBuffer literally just does a mem copy HOST SIDE?

     5) I call enqueueNDRangeKernel, after kernel->setArg(that buffer)

      6) THE GRAPHICS DRIVER?   handles sending the instruction set, and copying all "cl::Buffer" objects in host memory to available blocks on the device

      7) Kernel Executes

      😎 Kernel finishes executing and the GRAPHICS DRIVER copies the driver side buffers back host side (is this what is considered "done", for a CommandQueue:finish block?)

//

     9) We call enqueueReadBuffer on all of the buffers and copy back to the original containers.


See relevant discussion:
c++ - OpenCL Buffer Instantiation in a Multi Device Environment - Stack Overflow

If it hasn't been already, and I've just missed it, would it maybe be possible to highlight this somewhere in a nice little picture or something on your developer site? Thanks!

Steve

0 Likes
6 Replies
Dithermaster
Adept I

When you allocate the buffer (step 2) it is allocating it on the GPU.

When you call clEnqueueWriteBuffer you are asking the OpenCL runtime to copy host memory to device memory. This is a queued action and you can't free or re-use the host memory until the action completes (clFinish, or use an event, use a blocking write, or wait for a blocking command to finish).

When you call clEnqueueNDRangeKernel you are asking the runtime to run the kernel. It won't start until the WriteBuffer command finishes, because the command queue is in-order.

When you call clEnqueueReadBuffer you are asking the OpenCL runtime to copy device memory to host memory. This is a queued action and you shouldn't use the host memory until the action completes (clFinish, or use an event, use a blocking write, or wait for a blocking command to finish). The ReadBuffer won't start until the kernel is done because the command queue is in-order.

Does that improve your understanding?

0 Likes

So all of that was my original understanding as well from reading the opencl spec way back when (year+ ago). And yet, It does not explain the following:

  this->buffer = new

    cl::Buffer(

      this->ocl_context,

      CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR,

      mem_size,

      this->vector.data(),

      NULL

    );


No command queue was specified here, no enqueueWriteBuffer is ever called on this buffer.  If it's as you say it is, then when I enqueueNDRangeKernel, my kernel would simply operate on garbage data in the allocated space, and yet, it does not. Somewhere, this data was copied to all devices. Also, I'm fairly sure the graphics driver "allocates on first use", and never before.

0 Likes

Cheers stevenovakov, this helped me loads. I checked a bunch of AMD sites and i was beginning to pull my hair out.

0 Likes

Hi Stave,

    Your understanding that the buffer is "allocated on first use" is correct.

    When multiple devices are attached to one OpenCL context, it is OpenCL runtime's job to ensure data consistency of a buffer associated with this context across multiple devices. In the cases where devices have different physical memories (as in case of discrete GPUs), openCL runtime allocates (this allocation is done on first use) space on different devices for same OpenCL buffer. The run-time maintains which device has updated the OpenCL buffer last, and copies the buffer to other devices based on their need. The run-time algorithms take care that such data copy across devices is minimized, and take place concurrently when devices are busy executing kernels.

    Hope this helps.

AMD Support.

Awesome, but before I call it case closed, would you mind just confirming whether all, (or if not all, then which), of the cl::CommandQueue::  methods count as "use", in the "allocate on first use" paradigm, from the POV of the runtime/driver?

0 Likes

Hi Stave,

    It is not known how OpenCL run-time internally manages dirty and clean buffers on various devices and how and which high level APIs, like methods of cl::CommandQueue use them. However it makes little difference to application programmer. The only point that could be made is that run-time ensures data consistency of an OpenCL buffer across all the devices in a context, and in the interest of speed, takes care to transfer data as and when needed, and if possible concurrent to kernel execution.

    Please let me know if this answers your query.

Thanks,

AMD Support

0 Likes