I'm curious about overlapping data transfers and computation. My understanding is that the DMA engines are independent of kernel execution. In 4.4.5, you state that in the future, if you issue a bunch of independent data transfers and kernels to a queue then flush it, the OpenCL runtime will "keep the GPU busy with kernel execution and DMA transfers." Does this mean AMD's runtime will (in the future) support overlapping computation and transfers from the command queues alone? I'm trying to put together a higher level wrapper for clUtil that supports overlapping on both AMD and NVIDIA devices and does it without switching on each device's platform and doing it in the magical way needed by that particular platform.
Basically, NVIDIA requires that you allocate buffers using CL_MEM_ALLOC_HOST_PTR, and then using clEnqueueMapBuffer() for all your data transfer needs (to create a pinned buffer). You then create two command queues and issue commands to each one.
Will this overlap communication and transfer when DMA is fully implemented in APP?