I'm working on a platform with an A8 APU and a 7970 GPU.
I'm wondering if there are any differences in terms of access performance from both the host and the device between using the first and the second of the following buffer allocation strategies on APU's GPUs and on discrete GPUs:
1) Host standard memory allocation (malloc), initialize the host pointer (memset), create a buffer with CL_MEM_USE_HOST_PTR.
2) Create buffer with CL_MEM_ALLOC_HOST_PTR, map the buffer, initialize the buffer through the pointer (memset), unmap the buffer
Except for the fact that the second case crates a prepinned buffer, while in the first case the host pointer is pinned when createBuffer is called, I wonder if there are any differences in terms of creation/initialization/eventual transfer performances.
Do both of them allow zero-copy on both integrated and discrete GPUs?
Thank you very much!
If your os and devices support zero copy, the second case is zero copy buffer. And this buffer is in prepinned memory. Host can access it in peak bandwidth. Device accesses it in interconnect bandwidth. For small data transfer, zero copy delay is better than DMA delay. I think they do. Maybe I'm wrong!