Archives Discussions

gkhanna · ‎05-15-2011

Request for information on best practice

I just got a Linux box with AMD E-350 Accelerated Processor. Since this APU doesn't have the PCIe bottleneck, I'm trying to understand the best way to perform data transfer between the CPU and the GPU using OpenCL.

I have experimented a little bit with the PCIeBandwidth test example in the APP SDK 2.4. I'm getting ~2.5 GB/s which looks good (especially when I compare with another system that has a Radeon on the PCIe bus). Is there a better way to perform such a test? Any sample code somewhere?

Also, I understand that there is no zero-copy support in Linux yet. Will that impact this device-host bandwidth on the APU? If so, is there an estimate by how much?

Thanks!

himanshu_gautam · ‎05-16-2011

I would suggest you to refer to AMD APP OpenCL Programming Guide.

I think PCIBandwidth is an apt test to see improvements in data transfer.

MicahVillmow · ‎05-16-2011

gkhanna,
Although zero-copy support in linux doesn't exist, I would program like it does exist. This way when it gets enabled, you should see a performance boost with no code changes.

gkhanna · ‎05-17-2011

Thank you. Section 4.4 in that guide was very helpful.

holger · ‎05-22-2011

I'd like to dig deeper here. I would think that programming under the assumption of a zero copy is hugely different from programming without.

If I understand the concept correctly, mapping device memory, changing one bit and unmapping it causes two pci-transfers of the buffer if zero copying isn't used and hardly any traffic if it is. Is this correct? That means that I would rather recreate (and fill) a new buffer than modify one that is already on the device.

Can we hope for a zero copy implementation on linux any time soon? Is there an unsupported method of activating it? I wouldn't want to write code that might perform well if and when features are activated in the future.

nou · ‎05-23-2011

well mapping buffer dont mean two transfer over PCIe. when you map buffer you specify access flag. that mean if you specify CL_MAP_WRITE then implementation will just allocate some memory (and even that is not alway. see CL_MEM_ALLOC_HOST_PTR) where you can write your data. this memory will contain random data as OpenCL assume that you will only write into this memory region.

i recoment read throughtly whole memory optimization chapter in AMD OpenCL guide.

holger · ‎05-23-2011

Hi nou,

thank you for your reply. However, you did not, in fact, answer my question. I feel that, indeed, you have proven my point: it does make a difference if you have zero copy or not, even if the difference is only the importance of a flag. Also, Paragraph 4 of Section 4.4.3.1 of the Programming Guide explicitly recommends zero copy device resident memory for sparse updates of a buffer. This might have implications for the overall design of an application (I could probably construct a case if need be).

So should I hold my breath for zero copy memory objects on Linux?

Still, there is one thing I don't understand about your answer: does that mean that, when I map a buffer with CL_MAP_WRITE, the runtime allocates (virtual) memory and when unmapping, transfers only the parts that were modified (i.e., backed by physical memory)? Otherwise I'd end up with the "random data" in my buffer. I assume the granularity is that of a memory(TLB)-page, so usually 4K. Is that correct? This would, then, require the 4K pages to be filled with the correct data from the device on (actually before) modification. If I'm not mistaken, this would mean that, on a pagefault, the OS-kernel would have to transfer the respective page from the device.

Is implementing zero copying (memory mapping a device) on Linux that hard to implement that it justifies this complexity?

nou · ‎05-23-2011

ok i forgot that it is tru only for specific buffers. if you specific buffer with CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR it track if buffers was modified with clEnqueueWrite/Copy/NDRange. if not then it dont transfer from device to host.

with CL_MAP_READ there is never a host to device transfer.

Archives Discussions

device-host transfer on Fusion