According to AMD OpenCL Programming Guide, CL_MEM_USE_HOST_PTR should cause pre-pinned memory and this is suppose to be efficient. I am testing the following on Tahiti (on a mobo with PCIe 2.x) However I am getting strange results.
I have 2 buffers created with CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR , and 1 buffer is with CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR each consist of a 15000x15000 float dense matrix. Totaling to a ~1.8GB. I am running clAmdBlasSgemm on these... then loading the result back to host memory. I am using enqeue write/read buffer commands,with blocking and also tried nonblocking.
The result with events is: writing matrix1 0.1638seconds matrix2 0.1631seconds clAmdBlasSgemm 4.137 seconds 0.1448 seconds and according to my host code, it takes about 5.8 seconds to accomplish all these (wall time). If I change blocking to nonblocking but put clFinish after each operation, I get write 1.735 and 1598 seconds, sgemm, 4.138 seconds and read 0.1452 seconds. In either case these do not total up to 5.8 seconds.
If I get rid of CL_MEM_USE_HOST_PTR from objects then results are, writing (total) 0.3458 seconds, sgemm 4.137 seconds and reading 0.2411 seconds and wall time shows 4.9 seconds. I tried non_blocking read/write with clFinish() as well and got exactly same results from events.
So, without CL_MEM_USE_HOST_PTR, things go much quicker??? Is there a mistake in the manual???
I also tried the SDK BufferBandwidth example and this does not use HOST_PTR either...
$ ./BufferBandwidth -t 4
Device 0 Tahiti
GPU work items: 32768
Buffer size: 33554432
CPU workers: 1
Timing loops: 20
Kernel loops: 1
AVERAGES (over loops 2 - 19, use -l for complete log)
PCIe B/W device->host: 0.005904 s 5.68 GB/s
PCIe B/W host->device: 0.005211 s 6.44 GB/s