According to AMD OpenCL Programming Guide, CL_MEM_USE_HOST_PTR should cause pre-pinned memory and this is suppose to be efficient. I am testing the following on Tahiti (on a mobo with PCIe 2.x) However I am getting strange results.
I have 2 buffers created with CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR , and 1 buffer is with CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR each consist of a 15000x15000 float dense matrix. Totaling to a ~1.8GB. I am running clAmdBlasSgemm on these... then loading the result back to host memory. I am using enqeue write/read buffer commands,with blocking and also tried nonblocking.
The result with events is: writing matrix1 0.1638seconds matrix2 0.1631seconds clAmdBlasSgemm 4.137 seconds 0.1448 seconds and according to my host code, it takes about 5.8 seconds to accomplish all these (wall time). If I change blocking to nonblocking but put clFinish after each operation, I get write 1.735 and 1598 seconds, sgemm, 4.138 seconds and read 0.1452 seconds. In either case these do not total up to 5.8 seconds.
If I get rid of CL_MEM_USE_HOST_PTR from objects then results are, writing (total) 0.3458 seconds, sgemm 4.137 seconds and reading 0.2411 seconds and wall time shows 4.9 seconds. I tried non_blocking read/write with clFinish() as well and got exactly same results from events.
So, without CL_MEM_USE_HOST_PTR, things go much quicker??? Is there a mistake in the manual???
I also tried the SDK BufferBandwidth example and this does not use HOST_PTR either...
$ ./BufferBandwidth -t 4 Device 0 Tahiti Build: DEBUG GPU work items: 32768 Buffer size: 33554432 CPU workers: 1 Timing loops: 20 Repeats: 1 Kernel loops: 1 inputBuffer: CL_MEM_READ_ONLY outputBuffer: CL_MEM_WRITE_ONLY AVERAGES (over loops 2 - 19, use -l for complete log) -------- PCIe B/W device->host: 0.005904 s 5.68 GB/s PCIe B/W host->device: 0.005211 s 6.44 GB/s Passed! |
Thanks,
Evren