Archives Discussions

Raistmer · ‎05-09-2013

I need to pass few arrays of size <~128kB to GPU.

Cause CPU should only write these arrays and not read them, I think that 2 ways possible

1) create buffers on GPU with AMD_PERSISTENT flag. Map them to host, write directy to them, unmap, use on GPU until next cycle.

2) create buffer in GPU memory, create buffer in host pinned memory (ALLOC_HOST_PTR flag), map pinned buffer, write to pinned buffer, then use WriteBuffer to transfer data from pinned memory buffer to GPU memory. It's almost impossible to overlap kernel execution with memory transfer for now in that particular place of my app so this advantage of DMA most probably will be missed anyway in second case.

So the question is: if overlap withg kernel execution not possible for both ways, what way will provide fastest data transfer with smallect overall overhead ?

himanshu_gautam · ‎05-09-2013

Check table 4.2 in OpenCL Programming guide.

Also checkout out:http://devgurus.amd.com/message/1296694#1296694

himanshu_gautam · ‎05-10-2013

DMA Overlap with Kernel Execution is completely possible in AMD platforms. See the URL I had posted in the reply above.

The transferOverlap sample only talks about PIO (CPU Programmed IO) + OpenCL Kernel Overlap.

A DMA overlap sample is not there in the APP SDK. But the URL above has sources which show how DMA and Kernel can be overlapped.

To evaluate your approach, you may want to consider the following:

1. memset() a huge array in Pinned memory

2. memset() a huge array in Persistent memory on GPU

Evaluate the speed. Since Stores Fire and Forget, (2) may run faster than you expect.

So, as long as you are writing, persistent memory should be good. But its always good to experiment and find out.

Raistmer · ‎05-10-2013

Sorry, Himanshu, but looks like you completely missed the point of my post.

It's not any kind of bug report this time. I don't question DMA ability of AMD driver. When I said "DMA not possible for this part of app" it's just that, data will be used in subsequent kernel call so w/o algorithm change overlapping DMA not possible here (here, not at all). Also, memory transfer should be not huge, as I said, data size is less than 128k.

What I expected to hear in return is some feedback from more experienced AMD OpenCL users who maybe had same situation and can give some insights on this. Your testing methodology will give definitely distorted results as large memory array assumes not to count for any overhead for data transfer preparation. In my case that overhead can be crucial

Well, sure I can try both ways and decide what better, but such experimentation cost time... and forum boards are for sharing experience not?

himanshu_gautam · ‎05-10-2013

Sorry.. I read in a haste. A re-read helped.

Yeah, Just analyzing the problem (with overlap goodies removed):

1. PERSISTENT MEM --> Will work at the speed of PCIe at max - 8GBps

2. ALLOC_HOST_PTR --> Write to memory and then read by GPU.

This involves WRITE to RAM followed by READ through PCIe.

Assuming write to RAM is faster, this is still limited by PCIe and involves 2 transfers. Hmm.....

Apart from that I dont know if DMA uses PCIe bus bandwidth differently (say DMA uses a burst /streaming mode) than CPU doing a streamed-write (this one may depend on how PCI bridge is configured). This factor can affect timings.

But for all practical results - One needs to experiment. I dont have this data at the moment.

But if you find anything, please post. It will be useful to many. Thanks!

Archives Discussions

Write to GPU persistemt memory vs copy from pinned memory -what better?