I need to pass few arrays of size <~128kB to GPU.
Cause CPU should only write these arrays and not read them, I think that 2 ways possible
1) create buffers on GPU with AMD_PERSISTENT flag. Map them to host, write directy to them, unmap, use on GPU until next cycle.
2) create buffer in GPU memory, create buffer in host pinned memory (ALLOC_HOST_PTR flag), map pinned buffer, write to pinned buffer, then use WriteBuffer to transfer data from pinned memory buffer to GPU memory. It's almost impossible to overlap kernel execution with memory transfer for now in that particular place of my app so this advantage of DMA most probably will be missed anyway in second case.
So the question is: if overlap withg kernel execution not possible for both ways, what way will provide fastest data transfer with smallect overall overhead ?
DMA Overlap with Kernel Execution is completely possible in AMD platforms. See the URL I had posted in the reply above.
The transferOverlap sample only talks about PIO (CPU Programmed IO) + OpenCL Kernel Overlap.
A DMA overlap sample is not there in the APP SDK. But the URL above has sources which show how DMA and Kernel can be overlapped.
To evaluate your approach, you may want to consider the following:
1. memset() a huge array in Pinned memory
2. memset() a huge array in Persistent memory on GPU
Evaluate the speed. Since Stores Fire and Forget, (2) may run faster than you expect.
So, as long as you are writing, persistent memory should be good. But its always good to experiment and find out.
Sorry, Himanshu, but looks like you completely missed the point of my post.
It's not any kind of bug report this time. I don't question DMA ability of AMD driver. When I said "DMA not possible for this part of app" it's just that, data will be used in subsequent kernel call so w/o algorithm change overlapping DMA not possible here (here, not at all). Also, memory transfer should be not huge, as I said, data size is less than 128k.
What I expected to hear in return is some feedback from more experienced AMD OpenCL users who maybe had same situation and can give some insights on this. Your testing methodology will give definitely distorted results as large memory array assumes not to count for any overhead for data transfer preparation. In my case that overhead can be crucial
Well, sure I can try both ways and decide what better, but such experimentation cost time... and forum boards are for sharing experience not?
Sorry.. I read in a haste. A re-read helped.
Yeah, Just analyzing the problem (with overlap goodies removed):
1. PERSISTENT MEM --> Will work at the speed of PCIe at max - 8GBps
2. ALLOC_HOST_PTR --> Write to memory and then read by GPU.
This involves WRITE to RAM followed by READ through PCIe.
Assuming write to RAM is faster, this is still limited by PCIe and involves 2 transfers. Hmm.....
Apart from that I dont know if DMA uses PCIe bus bandwidth differently (say DMA uses a burst /streaming mode) than CPU doing a streamed-write (this one may depend on how PCI bridge is configured). This factor can affect timings.
But for all practical results - One needs to experiment. I dont have this data at the moment.
But if you find anything, please post. It will be useful to many. Thanks!