AnsweredAssumed Answered

CL_MEM_USE_HOST_PTR slower than not using it...

Question asked by yurtesen on May 27, 2012
Latest reply on Jun 24, 2012 by yurtesen

According to AMD OpenCL Programming Guide, CL_MEM_USE_HOST_PTR should cause pre-pinned memory and this is suppose to be efficient. I am testing the following on Tahiti (on a mobo with PCIe 2.x) However I am getting strange results.


I have 2 buffers created with CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR , and 1 buffer is with CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR each consist of a 15000x15000 float dense matrix. Totaling to a ~1.8GB. I am running clAmdBlasSgemm on these... then loading the result back to host memory. I am using enqeue write/read buffer commands,with blocking and also tried nonblocking.


The result with events is: writing matrix1  0.1638seconds matrix2 0.1631seconds clAmdBlasSgemm 4.137 seconds 0.1448 seconds and according to my host code, it takes about 5.8 seconds to accomplish all these (wall time). If I change blocking to nonblocking but put clFinish after each operation, I get write 1.735 and 1598 seconds, sgemm, 4.138 seconds and read 0.1452 seconds. In either case these do not total up to 5.8 seconds.



If I get rid of CL_MEM_USE_HOST_PTR from objects then results are, writing (total) 0.3458 seconds, sgemm 4.137 seconds and reading 0.2411 seconds and wall time shows 4.9 seconds. I tried non_blocking read/write with clFinish() as well and got exactly same results from events.


So, without CL_MEM_USE_HOST_PTR, things go much quicker??? Is there a mistake in the manual???



I also tried the  SDK BufferBandwidth example and this does not use HOST_PTR either...


$ ./BufferBandwidth  -t 4



Device  0            Tahiti

Build:               DEBUG

GPU work items:      32768

Buffer size:         33554432

CPU workers:         1

Timing loops:        20

Repeats:             1

Kernel loops:        1

inputBuffer:         CL_MEM_READ_ONLY

outputBuffer:        CL_MEM_WRITE_ONLY



AVERAGES (over loops 2 - 19, use -l for complete log)



          PCIe B/W device->host:  0.005904 s       5.68 GB/s

          PCIe B/W host->device:  0.005211 s       6.44 GB/s