To summarize it, I am getting very poor map/unmap performance. The device is a Radeon 7970 which runs on a bulldozer based machine (so NOT PCIe 3.0, but PCIe 2.1 16x).
I used both CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR, ran map/unmap at least twice to make sure that the issue is not the pinning costs. The result is for 2.7GB of data, the transfer takes about 1.8 seconds which makes roughtly 1.5GB/sec transfer speed (whch is ridiculously slow.
With Tesla M2050 for 2.5GB of data, the map/unmap takes about 0.5 secoonds... which makes about 5GB/sec expected speed... Also the write map copies the data from card to the host memory and back, while read map correctly only copies data from card to host.
I am not sure if AMD's implementation does some trick and do not copy from device to host because there was no kernel ran in the device in my test case (which would be smart), maybe thats why it does not transfer it in? In either case, AMD loses when map/unmap is used. So what is the reason? A bug in the SDK? The sample program attached. Can anybody have a look?
So, can you tell
On Radeon 7970
Size: 225000000 x 4 = 900000000 bytes * 3 = 2700000000
Tahiti
WALL time for CL_MEM_ALLOC_HOST_PTR = 0.00 seconds
CL_MAP_WRITE 0
WALL time for Map #0 = 0.01 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 2.39 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_WRITE 1
WALL time for Map #1 = 0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 1.76 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 0
WALL time for Map #0 = 0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.00 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 1
WALL time for Map #1 = 0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.00 seconds
Unmap #events: 3 time: 0.0000 seconds.
Mapping with CL_MAP_WRITE and writing to mapped area
WALL time for Map #1 = 0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for write mapped area = 1.31 seconds
WALL time for Unmap = 1.79 seconds
Unmap #events: 3 time: 0.0000 seconds.
Allocating memory using memalign 4096
WALL time for memalign = 0.00 seconds
WALL time for CL_MEM_USE_HOST_PTR = 0.18 seconds
CL_MAP_WRITE 0
WALL time for Map #0 = 0.01 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 2.18 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_WRITE 1
WALL time for Map #1 = 0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 1.74 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 0
WALL time for Map #0 = 0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.00 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 1
WALL time for Map #1 = 0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.00 seconds
Unmap #events: 3 time: 0.0000 seconds.
Mapping with CL_MAP_WRITE and writing to mapped area
WALL time for Map #1 = 0.00 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for write mapped area = 1.33 seconds
WALL time for Unmap = 1.78 seconds
Unmap #events: 3 time: 0.0000 seconds.
With Teslla M2050
Size: 210000000 x 4 = 840000000 bytes * 3 = 2520000000
Tesla M2050
WALL time for CL_MEM_ALLOC_HOST_PTR = 0.00 seconds
CL_MAP_WRITE 0
WALL time for Map #0 = 0.53 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.53 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_WRITE 1
WALL time for Map #1 = 0.87 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.53 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 0
WALL time for Map #0 = 0.83 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.10 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 1
WALL time for Map #1 = 0.82 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.10 seconds
Unmap #events: 3 time: 0.0000 seconds.
Mapping with CL_MAP_WRITE and writing to mapped area
WALL time for Map #1 = 0.82 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for write mapped area = 0.54 seconds
WALL time for Unmap = 0.57 seconds
Unmap #events: 3 time: 0.0000 seconds.
Allocating memory using memalign 4096
WALL time for memalign = 0.00 seconds
WALL time for CL_MEM_USE_HOST_PTR = 0.59 seconds
CL_MAP_WRITE 0
WALL time for Map #0 = 0.51 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.48 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_WRITE 1
WALL time for Map #1 = 0.56 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.48 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 0
WALL time for Map #0 = 0.56 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.01 seconds
Unmap #events: 3 time: 0.0000 seconds.
CL_MAP_READ 1
WALL time for Map #1 = 0.56 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for Unmap = 0.00 seconds
Unmap #events: 3 time: 0.0000 seconds.
Mapping with CL_MAP_WRITE and writing to mapped area
WALL time for Map #1 = 0.56 seconds
Map #events: 3 time: 0.0000 seconds.
WALL time for write mapped area = 0.32 seconds
WALL time for Unmap = 0.48 seconds
Unmap #events: 3 time: 0.0000 seconds.