I think the 32MB limit comes from Table 4.2 in AMD APP Programming Guide. This is the case for normal regular buffers (which are not pinned and stored in device usually) and the guide is talking about behaviour of "clEnqueueMap"
But, if you want to use DMA - you got to Pin the buffer. Pinning usually happens when you use "USE_HOST_PTR". Either the host application pages are directly pinned (or) the host application pages are copied to a temporary pinned buffer for one-shot transfer (or) Transferred chunk by chunk using DMA and double-buffering. The run-time will decide the time of transfer (depending on first time usage mostly.) Until you MAP that buffer, the OpenCL runtime will own your host-ptr. When you map, you own it - you can write to it.. When you UNMAP, control returns to OpenCL run-time.
When you use ALLOC_HOST_PTR, if zero-copy is supported, pinned memory is allocated. The KERNEL can directly read this data using a pointer and hence data-transfer and kernel execution occur together -- which is not a great way to overlap data-transfer and kernel execution (GPU is too fast and will often stall waiting for data to arrive from system memory across PCIe)
When you use PERSISTENT_AMD flag, the buffer is allocated inside the GPU and the CPU gets a pointer (that read/writes across the PCIe bus). In this case, memcpy and kernel execution can happen together. But the memcpy is PIO and cannot be called as DMA.
The best way to overlap a transfer with kernel execution is to first allocate Pinned buffer (using ALLOC_HOST_PTR), Map it to get the pointer and write something onto the buffer. Allocate another normal buffer (which sits on GPU). Now, do a clEnqueueWrite* from pinned buffer to the normal buffer. This is just DMA.
It is this DMA that I would like to overlap with Kernel execution. I am still investigating whether this is possible or not.
Will post an update next week.