I am trying to understand the mechanics of OpenCL memory access and transfers (in particular on AMD Ryzen V1000 embedded systems coming with Zen cores and an embedded Vega GPU), with the motivation of wanting to develop a high throughput streaming application that performs calculations on the GPU. The data to be processed reaches several GB/s, so high throughput to and from the GPU is of key importance and the current approach spends more than 50% of the time in memory transfers (hence the wish to optimize this).
Specifically, I wanted to understand better if memory copying between CPU memory and GPU memory is always needed in OpenCL, even on embedded systems where CPU and GPU share the same physical memory (either explicitly via clEnqueueWriteBuffer, clEnqueueReadBuffer, or implicitly via clEnqueueMapBuffer)? Although the ROCm OpenCL optimization guide lists a multitude of approaches to "zero copy" buffers (sections 184.108.40.206 and 220.127.116.11), these all involve copying the data from or to CPU/GPU memory, if I am not mistaken? Should one not be able to skip this step on embedded systems where CPU+GPU access the same physical memory (and there is no physically separate GPU memory)? And if a copy of the data has to be made even on embedded GPUs, why is this the case? Because even those are internally wired up through PCIe and CPU+GPU caches must be kept in sync (just a wild guess)?
Finally, what are the peak data rates of CPU-GPU interconnects to be expected on modern systems like the AMD Ryzen V1000 embedded systems (i.e. with an embedded Vega GPU)?
Any information and further insights on this issue are highly appreciated because I found quite some conflicting information regarding zero copy buffers. Thanks!