When you allocate the buffer (step 2) it is allocating it on the GPU.
When you call clEnqueueWriteBuffer you are asking the OpenCL runtime to copy host memory to device memory. This is a queued action and you can't free or re-use the host memory until the action completes (clFinish, or use an event, use a blocking write, or wait for a blocking command to finish).
When you call clEnqueueNDRangeKernel you are asking the runtime to run the kernel. It won't start until the WriteBuffer command finishes, because the command queue is in-order.
When you call clEnqueueReadBuffer you are asking the OpenCL runtime to copy device memory to host memory. This is a queued action and you shouldn't use the host memory until the action completes (clFinish, or use an event, use a blocking write, or wait for a blocking command to finish). The ReadBuffer won't start until the kernel is done because the command queue is in-order.
Does that improve your understanding?
So all of that was my original understanding as well from reading the opencl spec way back when (year+ ago). And yet, It does not explain the following:
this->buffer = new
CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR,
No command queue was specified here, no enqueueWriteBuffer is ever called on this buffer. If it's as you say it is, then when I enqueueNDRangeKernel, my kernel would simply operate on garbage data in the allocated space, and yet, it does not. Somewhere, this data was copied to all devices. Also, I'm fairly sure the graphics driver "allocates on first use", and never before.
Cheers stevenovakov, this helped me loads. I checked a bunch of AMD sites and i was beginning to pull my hair out.
1 of 1 people found this helpful
Your understanding that the buffer is "allocated on first use" is correct.
When multiple devices are attached to one OpenCL context, it is OpenCL runtime's job to ensure data consistency of a buffer associated with this context across multiple devices. In the cases where devices have different physical memories (as in case of discrete GPUs), openCL runtime allocates (this allocation is done on first use) space on different devices for same OpenCL buffer. The run-time maintains which device has updated the OpenCL buffer last, and copies the buffer to other devices based on their need. The run-time algorithms take care that such data copy across devices is minimized, and take place concurrently when devices are busy executing kernels.
Hope this helps.
Awesome, but before I call it case closed, would you mind just confirming whether all, (or if not all, then which), of the cl::CommandQueue:: methods count as "use", in the "allocate on first use" paradigm, from the POV of the runtime/driver?
It is not known how OpenCL run-time internally manages dirty and clean buffers on various devices and how and which high level APIs, like methods of cl::CommandQueue use them. However it makes little difference to application programmer. The only point that could be made is that run-time ensures data consistency of an OpenCL buffer across all the devices in a context, and in the interest of speed, takes care to transfer data as and when needed, and if possible concurrent to kernel execution.
Please let me know if this answers your query.