I am currently working on overlapping my memory transfers (each transfer about 1 GB in size) with computation.
However, even when I use async memory transfers in OpenACC the profiler shows me that the OpenCL command queue runs in-line, blocking all other commands until the transfer is done. So I cannot do the computation concurrently (on another set of data previously brought into memory).
Is there a way to change to out-of-order? And would that resolve the issue? If not, how can I resolve the issue? I cannot fetch the command queue as that is a openacc 2.0 feature that is not yet implemented in the latest compiler. But even if I could, I am not sure if it is supported to do out-of-order.
Is there a way to set the default to out-of-order (preferably a environmental variable or something alike)? Is it supported by the GPU/runtime/SDK?
Is there another way to overlap the compute and memory transfers, if the above is not possible?