DMA Overlap with Kernel Execution is completely possible in AMD platforms. See the URL I had posted in the reply above.
The transferOverlap sample only talks about PIO (CPU Programmed IO) + OpenCL Kernel Overlap.
A DMA overlap sample is not there in the APP SDK. But the URL above has sources which show how DMA and Kernel can be overlapped.
To evaluate your approach, you may want to consider the following:
1. memset() a huge array in Pinned memory
2. memset() a huge array in Persistent memory on GPU
Evaluate the speed. Since Stores Fire and Forget, (2) may run faster than you expect.
So, as long as you are writing, persistent memory should be good. But its always good to experiment and find out.
Sorry, Himanshu, but looks like you completely missed the point of my post.
It's not any kind of bug report this time. I don't question DMA ability of AMD driver. When I said "DMA not possible for this part of app" it's just that, data will be used in subsequent kernel call so w/o algorithm change overlapping DMA not possible here (here, not at all). Also, memory transfer should be not huge, as I said, data size is less than 128k.
What I expected to hear in return is some feedback from more experienced AMD OpenCL users who maybe had same situation and can give some insights on this. Your testing methodology will give definitely distorted results as large memory array assumes not to count for any overhead for data transfer preparation. In my case that overhead can be crucial
Well, sure I can try both ways and decide what better, but such experimentation cost time... and forum boards are for sharing experience not?
Sorry.. I read in a haste. A re-read helped.
Yeah, Just analyzing the problem (with overlap goodies removed):
1. PERSISTENT MEM --> Will work at the speed of PCIe at max - 8GBps
2. ALLOC_HOST_PTR --> Write to memory and then read by GPU.
This involves WRITE to RAM followed by READ through PCIe.
Assuming write to RAM is faster, this is still limited by PCIe and involves 2 transfers. Hmm.....
Apart from that I dont know if DMA uses PCIe bus bandwidth differently (say DMA uses a burst /streaming mode) than CPU doing a streamed-write (this one may depend on how PCI bridge is configured). This factor can affect timings.
But for all practical results - One needs to experiment. I dont have this data at the moment.
But if you find anything, please post. It will be useful to many. Thanks!