I haven't found much recent information on this subject on the forum. I saw that in v2.7 that asynchronous DMA and kernel execution was supported so I am unsure how relevant the older (i.e., ~1 year old) posts are on this subject.
To perform asynchronous read, write and execution do I need to have three command queues with APP v2.9 or can I do this with one (out-of-order command?) queue?
Are out of order command queues supported with AMD GPUs at this point?
Asynchronous DMA and computation can be achieved through separate command queues. You can check the APP SDK 2.9 sample AsyncDataTransfer that demonstrates this.
Out-of-order command queues are not yet supported on AMD GPUs.
AMD implementation have support for concurrent execution. That mean if you execute three kernels on single queue and there are no dependency it can execute concurrently. So it is not strictly in order.
It would seem then that AMD supports out of order queues...?
Here is what I am currently doing: I have one queue that I set to be out of order. I then issue several reads and write commands and one kernel execution command. I then wait for all to be completed using clFinish(). There is no data dependency between the reads, writes, and kernel execution. Based on what prao has said and comments in other forum posts, it would appear that these operations would happen serially. Is this correct?
1. As of now, AMD GPU does not support out-of-order queues. To make sure whether your device has support of out-of-order queues, check clinfo.
2. By default, all the clEnqueues commands (read/wrire/execution) are asynchronous with resp to host. But they execute serially on device.
3. To execute commands asynchronous on device, you must need at least 2 command queues. But result of overlapping of data transfer with device computation depends upon whether your device has support of at least 2 hardware command queues or not.
4. As per the 5.5.6 section of "AMD Accelerated Parallel Processing OpenCL Programming Guide-rev-2.7" book, for Southern Islands and later, devices support at least 2 hardware command queues.
Regarding your last post, you are correct and It seems that these operations would execute serially.