Archives Discussions

rick_weber · ‎04-02-2012

I know this is a preview feature, but I can't seem to interleave compute and transfers on my Radeon 7970s. I'm using SDK 2.6 and the 8.921 linux 64 drivers on Ubuntu.

I do the following:

1) Create two command queues

2) Create 6 buffers

3) set to queue 0

4) enqueue transfer transfer execute

5) set to queue 1

6) enqueue transfer transfer execute

7) keep doing 3-6 a few times

😎 reduce outputs from queues 1 and 2

9) transfer reduced output back to host

10) flush both queues

I'm allocating the buffers as CL_READ_WRITE (nothing else) and calling clEnqueueWriteBuffer() with CL_FALSE for blocking to do the transfers. I daisy chain events for safety reasons (long story short, the library does this), but because each queue only touches 3 of the 6 buffers, there's no dependencies between them (other than the reduction).

When I plot the profiling data on a timeline, it's immediately clear the driver isn't overlapping computation and execution; queue 1 does some transfers and executions then stalls while queue 2 does some then back and forth.

Is there anything I'm missing with this feature, like needing to use pinned memory and clEnqueueMap and such? I want this to work in a general purpose way and I really don't want to have to use clEnqueMap and pinned memory unless I really have to, as I'll need to break transfers into pieces, stream them through pinned memory, and watch events to know when to start the next memcpy. Furthermore, I assumed the runtime already did this anyways.

Archives Discussions

GPU_ASYNC_MEM_COPY=2 doesn't seem to work