My application runs the same sequence of kernels several times for different chunks of data.
I'm trying to overlap transfer with running kernels, specifically, I want to transfer the next piece of data while the current one is being processed.
The input data is stored in 4K-aligned host memory for which clCreateBuffer was called with CL_MEM_USE_HOST_PTR flag (i. e. in pinned memory). The transfers to GPU are done using clCopyBuffer calls which are indeed faster than calling clEnqueueWriteBuffer for regular host memory blocks.
In order to overlap transfer with compute I'm trying to use two different queues (in the same manner as streams are used in CUDA). However, this results in sequential execution. The only sample in SDK I found on the subject is TransferOverlap, but using CL_MEM_USE_PERSISTENT_MEM_AMD does not seem to be a viable option for my case. There is no way to get input data in such buffers straight away, so I'll have to copy to that memory on the host first. This results in CPU load spike, in addition, it doubles memory consumption. Also, host->gpu transfer rate will be suboptimal.
On CUDA the same technique works nicely: there are two queues on devices with compute capability 1.1-3.0, one of them executes transfers from and to host, the other one executes kernels. Commands from two queues can run in parallel (and this works in fact).
Are AMD GPUs capable of transferring data to GPU parallel to executing kernels when input data is not stored in CL_MEM_USE_PERSISTENT_MEM_AMD buffers?