I have a piece of OpenCL code where the data transfer (ReadBuffer / WriteBuffer) takes about the same time as the computation. I would like to allocate 2 input and 2 output buffers and run the kernel on one pair of buffers while I read/write the other pair. Is this possible in AMD's OpenCL implementation?
I tried using an out-of-order queue, but I did not achieve a speedup over a synchronous version.