1 Reply Latest reply on Feb 17, 2015 1:27 AM by dipak

    Zero Copy on Kabini


      Hi there,


      I port part of  my application (a optimized HEVC decoder) from CPU to on GPU and found the memory copy is a bottleneck, on every GPU from AMD, NVIDIA, Intel.

      So I would like to apply the zero copy optimization on Kabini.


      First question is that Have the busses (FCL, or "Onion", and Radeon memory bus, or "Garlic") in the Unified North Bridge (UNB) changed from Llano to Kabini? I know now Kavari support OpenCL 2.0 but currently I only have Kabini on hand.


      Second question, is it possible to achieve zero copy using "USE_HOST_PTR" flag? Because I have legacy memory buffer created by application from malloc(), so I want to use this flag to create a GPU buffer object. However, according to programming guide (rev 2.7, November 2013, Pre-pinned Buffers). "As long as they (buffer of type CL_MEM_USE_HOST_PTR) are used only for data transfer, but not as kernel arguments. If the buffer is used in a kernel, the runtime creates a cached copy on the device, and subsequent copies are not on the fast path". So there will be a copy still.

      In my opinion this is pointless OpenCL implementation, if we don't touch this buffer with GPU, then why we create this buffer for GPU kernel execution?


      Best regards,


        • Re: Zero Copy on Kabini

          Hi Biao,


          You may check these following articles to compare UNB for Llano and Kabini.




          Now, coming to your zero-copy question. Flag CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR can be used to create host-side zero-copy buffer whereas flag CL_MEM_USE_PERSISTENT_MEM_AMD can be used for device-side zero copy buffer. As you want to use your previously allocated memory, you may check the section " Application Scenarios and Recommended OpenCL Paths" in AMD optimization guide for more information.

          The reference lines you mentioned are mainly applicable for discrete GPUs, not for current APUs where CPU and GPU share same physical memory. In case of dGPU, an initial data copy from system memory to GPU memory may increase the performance especially when the data has been accessed multiple times from the kernel. Here, GPU can access the local copy of data in GPU memory at much faster rate compare to system memory by avoiding subsequent PCIe communications. However, this extra data copy is not needed in case of APU. There, GPU can directly access the system memory via different buses(Garlic or Onion). The peak bandwidth depends on the actual data access path.


          Hope this explanation is helpful for you.