You may check these following articles to compare UNB for Llano and Kabini.
Now, coming to your zero-copy question. Flag CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR can be used to create host-side zero-copy buffer whereas flag CL_MEM_USE_PERSISTENT_MEM_AMD can be used for device-side zero copy buffer. As you want to use your previously allocated memory, you may check the section "220.127.116.11 Application Scenarios and Recommended OpenCL Paths" in AMD optimization guide for more information.
The reference lines you mentioned are mainly applicable for discrete GPUs, not for current APUs where CPU and GPU share same physical memory. In case of dGPU, an initial data copy from system memory to GPU memory may increase the performance especially when the data has been accessed multiple times from the kernel. Here, GPU can access the local copy of data in GPU memory at much faster rate compare to system memory by avoiding subsequent PCIe communications. However, this extra data copy is not needed in case of APU. There, GPU can directly access the system memory via different buses(Garlic or Onion). The peak bandwidth depends on the actual data access path.
Hope this explanation is helpful for you.