As discussed in another thread "Optimization Guide Memory Allocation", according to the Optimization guide, when the display driver fglrx supports VM, and data is transferred from the application to the GPU kernel device, this should be a 0-copy when using the appropriate flags in CreateBuffer and use MapBuffer for the transfer. I imagine this to work in other SDKs, since it is written in the guide.
In my case:
clinfo | grep Driver
Driver version: 1445.5 (VM)
I'm using CL_MEM_ALLOC_HOST_PTR in my CreateBuffer and use MapBuffer for the transfer of data. CodeXL reports for the same exactly amount of data:
A) Read/Write Buffers
WriteBuffer: 173 ms for 6241 calls each@.02773 ms
ReadBuffer: 122 ms for 390 calls each@.312 ms
B) Map/Unmap Buffers
MapBuffer: 193 ms for 6630 calls each@.02907 ms
UnmapBuffer: 120 ms for 6630 calls firstname.lastname@example.org ms
Notice that actually the sum of Read/WriteBuffer calls is slightly less than the sum of the Map/UnmapBuffer calls, a far cry from the 0-copy it should be.