I am trying to understand the mechanics of OpenCL memory access and transfers (in particular on AMD Ryzen V1000 embedded systems coming with Zen cores and an embedded Vega GPU), with the motivation of wanting to develop a high throughput streaming application that performs calculations on the GPU. The data to be processed reaches several GB/s, so high throughput to and from the GPU is of key importance and the current approach spends more than 50% of the time in memory transfers (hence the wish to optimize this).
Specifically, I wanted to understand better if memory copying between CPU memory and GPU memory is always needed in OpenCL, even on embedded systems where CPU and GPU share the same physical memory (either explicitly via clEnqueueWriteBuffer, clEnqueueReadBuffer, or implicitly via clEnqueueMapBuffer)? Although the ROCm OpenCL optimization guide lists a multitude of approaches to "zero copy" buffers (sections 126.96.36.199 and 188.8.131.52), these all involve copying the data from or to CPU/GPU memory, if I am not mistaken? Should one not be able to skip this step on embedded systems where CPU+GPU access the same physical memory (and there is no physically separate GPU memory)? And if a copy of the data has to be made even on embedded GPUs, why is this the case? Because even those are internally wired up through PCIe and CPU+GPU caches must be kept in sync (just a wild guess)?
Finally, what are the peak data rates of CPU-GPU interconnects to be expected on modern systems like the AMD Ryzen V1000 embedded systems (i.e. with an embedded Vega GPU)?
Any information and further insights on this issue are highly appreciated because I found quite some conflicting information regarding zero copy buffers. Thanks!
On an APU, the system memory is physically shared between the GPU and the CPU. However it is visible by either the CPU or the GPU at any given time. During the map/unmap operations, no physical/actual data transfer happens for zero-copy objects; it logically moves the buffer between the CPU and the GPU address space. A kernel running on the GPU can access the zero-copy data directly and the access bandwidth is generally much higher than the dGPUs.
One important point to note. Even though physical memory is shared, the memory access path can be different for CPU and GPU, hence no CPU-GPU synchronization is guaranteed other than the memory synchronization points defined by the OpenCL standard. Hence, host or device should not access the data while other one is using it.
P.S. I've whitelisted you.