I was wondering what happens in Fusion APU devices (Liano in particular) when an OpenCL context contains both the GPU and the CPU as target devices and a buffer is declared. Specifically, consider the following scenario:
If a discrete card were used, the CPU and GPU would each have the buffer declared in their own respective memories and a copy would take place from the CPU's memory to the GPU's memory after Kernel A completes and before Kernel B begins. Is this also true using a Fusion APU if the buffer is declared using the CL_MEM_ALLOC_HOST_PTR on Windows 7, or will there be zero copying in this case?
The AMD APP OpenCL Programming Guide describes zero copy in terms of mapping/unmapping to/from the host. Do I therefore need to use the CPU device's command queue to map the buffer to the host after Kernel A completes, then use the GPU device's command queue to unmap the buffer from the host before calling Kernel B, or will zero copy (and the appropriate flushing of the buffer from the CPU's cache) take place without doing this step?
Because memory is truly shared between CPU and GPU on APU devices, the "copy" occurs as soon memory is written by your kernel (excluding any caching). However, as I understand you need to call clEnqueueUnmapMemObject() after your CPU kernel has written to your memory and before accessing the memory on the GPU to hand over control for this memory region to the GPU (theoretically it may also ensure all relevent cached memory is written to memory).
My understanding is that unmapping is used to unmap previously mapped buffers from the CPU host and it is not used to unmap buffers from devices (including the CPU if it happens to be used as a device). I would agree with this approach if the buffer was previously mapped and the CPU host was writing to the buffer instead of the CPU device.
In the case I originally described, the CPU device is executing a kernel that writes to the buffer, the buffer is never mapped to the host, and the buffer is declared using the CL_MEM_ALLOC_HOST_PTR flag on Windows 7.
I believe that I should be able to just use the proper synchronization methods (i.e. setting up dependencies between kernels) to ensure data consistency between the CPU device and GPU device, but for performance reasons I'd like to understand what is actually taking place in the background. Does the hardware effectively just flush the CPU's cache and disable further caching on that buffer region before the GPU device begins it's kernel, or is some form of copying between memory locations taking place somewhere?
if the memory object is created by CL_MEM_ALLOC_HOST_PTR, the cost of map & unmap operation is very small, there is no memory copy ocurr. Each time, GPU will go to system memory to access data.
About cache, if setting flag CL_MEM_READ_ONLY when create memory object, CPU will not store data in cache. However, if setting CL_MEM_READ_WRITE, CPU will still go through cache. But GPU will take onion bus (or garlic bus) to access cache to guarantee coherency.
As long as you correctly setup the event dependencies the runtime will ensure that the memory is coherent. The way the buffer is set up should mean that it won't copy the data at all. Of course this means that access from either the GPU or CPU will be somewhat slower than if the buffer were native to that device thanks to the limited coherence and hence caching support on Llano so there may be circumstances where you'd be better off not setting up the buffer that way and letting the runtime do what it thinks best.
Currently, zero copy transfers are set up for CPU (host) to device, not vise versa. when making an OpenCL mem object from a host pointer, it must be declared as CL_MEM_READ_ONLY. Basically the host does the writing and the device (typically the APU, but in theory could be the CPU) can only read from the buffer.
Ah, that's good.
I assumed it was read only from the post here: (http://blogs.amd.com/developer/2011/08/01/cpu-to-gpu-data-transfers-exceed-15gbs-using-apu-zero-copy...)
..but have never experimented with it (use a discrete card)
For peak performance that's true. The flag combination allows the APU to use the fast read path from the GPU that doesn't snoop the caches, so you can use the non-coherent GPU caches and make the memory write-combine only on the CPU side. It's fast in one direction.