Note: I need simultaneous execution of DMA up, DMA down, and computation for my application. I have already proven I can do this on Nvidia GPU's, but now I need to see if there is a solution using open source drivers, due to licensing problems inherent in Linux (the GPL). So, if this operation is feasible, the second part of my question comes into play:
Is this capability supported via the OpenCL 1.2 API? If so, should I create three parallel task queues, which is how Cuda C handles this, or is some other method of implementation envisioned/recommended for this?
Thanks in advance to anyone who can help.