my current implementation is that create 2 queue for a single GPU device,
one queue is only for memory transfer API such as:
clEnqueueReadBuffer, clEnqueueWriteBuffer or clEnqueueCopyBuffer.
another queue is only for GPU computing API such as clEnqueueNDRangeKernel
and I sync these 2 queues using shared cl_event objects when it is necessary.
But for my test, this can not make transfer and computing concurrent. does it mean that clEnqueueCopyBuffer and clEnqueueNDRangeKernel will execute serially even they are on different queues?
Concurrent memory transfer and kernel execution can happen in case of single queue also. For this to happen DMA should be enabled, which not the case with current implementation.
Thanks, is there any plan (in some future version?) of DMA enabling?
Originally posted by: omkaranathan Concurrent memory transfer and kernel execution can happen in case of single queue also. For this to happen DMA should be enabled, which not the case with current implementation.
Yes, we are working on it.
Originally posted by: omkaranathan Concurrent memory transfer and kernel execution can happen in case of single queue also. For this to happen DMA should be enabled, which not the case with current implementation.
So AMD's OpenCL compiler doesn't allow for async transfer? Odd.
Micah,
From the little OpenCL I've done I though that unless you waited on these routines that they would be async? I assume this is not correct, thanks, good to know if I want to try and pipeline my data/execution.
Originally posted by: MicahVillmow ryta, I don't he is refering to the async_copy functions but EnqueueRead/Write buffer.
Micah, you mean command of clEnqueueCopyBuffer can run concurrently with command of clEnqueueNDRangeKernel? while clEnqueueReadBuffer can not?
Thanks
Originally posted by: MicahVillmow On architectures that support async data copies, they will be asynchronous, otherwise they will not be. 7XX and Evergreen hardware does not support async kernel copies. Micah
Interesting, this means essentially in all your "current" hardware that there exists no way to do an async data transfer, so no "pipelining" can occcur?
I thought this was possible in CAL (though I've never tried it) through DMA?
I'm confused, sorry, lol.
Ok, thought so, thanks.