There seem to be some conflicting information about this. Some people seem to suggest using multiple command queues and some say it is not supported and some say a single command queue is enough... So...
1- Can the driver execute memory transfers of objects while executing a kernel? (assuming memory objects are not related to the kernel currently running).
2- Do we need to set
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE or does this option only effect overlapping kernel executions?
1. Yes, these two operations can be overlapped on certain hardwares. With DMA support, device can perform kernel execution while doing independent memory transfer operation. For more details, I would refer you to check AMD's OpenCL optimization guide.
2. AFAIK, on AMD platform, host-side queue works as in-order manner. However, certain devices have hardware support which can simultaneously handle multiple commands from multiple queues. Hence, one can use multiple command queues to enqueue many independent tasks to device at the same time.
Note: As per OpenCL spec, supporting out-of-order queue is not a mandatory feature. So, I guess if you pass the out-of-order flag during command queue creation, the implementation may ignore this flag if out-of-order is not supported by the platform.
1- if you check the latest SDK, you can see that CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is used in 3 examples, are you sure all the AMD devices work in-order? (if not, which ones support this flag?)
2- Can you tell or show a document which AMD devices support multiple queues?
3- Is there any known drawbacks of using multiple command queues? For example things like invalidating caches etc?
1. Are you referring to device side enqueue examples? If so, then, as per OpenCL spec (clCreateCommandQueueWithProperties), a device side enqueue must be an out-of-order queue.
2. As you know, OpenCL programming supports multiple command queues(more specifically software command queues). However, how commands enqueued to such multiple queues are handled depends on the hardware and implementation. To know how it works on AMD platform, you may check the section "1.3.6 Command Queue" of AMD's OpenCL programming guide.
3. I'm not aware of such drawbacks. As I guess, using multiple queues can actually improve performance if underlying devices have such hardware support. However, it is programmer's responsibility to enqueue independent commands on multiple queues to take proper advantages.