OpenCL runtime uses deferred allocation by delaying buffer allocation until first use. So, if a initialize buffer is passed with CL_MEM_COPY_HOST_PTR, the runtime has to copy the data into a temporary runtime buffer. The memory is allocated on the device when the device first accesses the resource. At that time, any data that must be transferred to the resource is copied.
The buffer contents are not initialized at creation. If any initialization is required, application needs do it explicitly (for example, using CL_MEM_COPY_HOST_PTR).
A typical call sequence may be (assuming a dGPU):
- Create a zero-copy host-visible device buffer (with flag CL_MEM_USE_PERSISTENT_MEM_AMD ) [ there is a size limit, typically few MB]
- clEnqueueFillBuffer (or run a kernel to fill the device buffer with zero)
- Run the kernel
- clEnqueueReadBuffer
Please note, actual steps depend on the exact usage and also on underlying hardware (say APU or dGPU). It is recommended to do some experiments before choosing one.
I would suggest you to read the section 1.3 (OpenCL Memory Objects) and section 1.4 (OpenCL Data Transfer Optimization) in AMD OpenCL Optimization guide that explain the memory allocation of buffer objects and various optimized paths for data transfer. It also describes various application scenarios, and the corresponding paths in the OpenCL API that are known to work well on AMD platforms.
Thanks.