I know this is a preview feature, but I can't seem to interleave compute and transfers on my Radeon 7970s. I'm using SDK 2.6 and the 8.921 linux 64 drivers on Ubuntu.
I do the following:
1) Create two command queues
2) Create 6 buffers
3) set to queue 0
4) enqueue transfer transfer execute
5) set to queue 1
6) enqueue transfer transfer execute
7) keep doing 3-6 a few times
😎 reduce outputs from queues 1 and 2
9) transfer reduced output back to host
10) flush both queues
I'm allocating the buffers as CL_READ_WRITE (nothing else) and calling clEnqueueWriteBuffer() with CL_FALSE for blocking to do the transfers. I daisy chain events for safety reasons (long story short, the library does this), but because each queue only touches 3 of the 6 buffers, there's no dependencies between them (other than the reduction).
When I plot the profiling data on a timeline, it's immediately clear the driver isn't overlapping computation and execution; queue 1 does some transfers and executions then stalls while queue 2 does some then back and forth.
Is there anything I'm missing with this feature, like needing to use pinned memory and clEnqueueMap and such? I want this to work in a general purpose way and I really don't want to have to use clEnqueMap and pinned memory unless I really have to, as I'll need to break transfers into pieces, stream them through pinned memory, and watch events to know when to start the next memcpy. Furthermore, I assumed the runtime already did this anyways.