Archives Discussions

yurtesen · ‎05-27-2012

According to AMD OpenCL Programming Guide, CL_MEM_USE_HOST_PTR should cause pre-pinned memory and this is suppose to be efficient. I am testing the following on Tahiti (on a mobo with PCIe 2.x) However I am getting strange results.

I have 2 buffers created with CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR , and 1 buffer is with CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR each consist of a 15000x15000 float dense matrix. Totaling to a ~1.8GB. I am running clAmdBlasSgemm on these... then loading the result back to host memory. I am using enqeue write/read buffer commands,with blocking and also tried nonblocking.

The result with events is: writing matrix1 0.1638seconds matrix2 0.1631seconds clAmdBlasSgemm 4.137 seconds 0.1448 seconds and according to my host code, it takes about 5.8 seconds to accomplish all these (wall time). If I change blocking to nonblocking but put clFinish after each operation, I get write 1.735 and 1598 seconds, sgemm, 4.138 seconds and read 0.1452 seconds. In either case these do not total up to 5.8 seconds.

If I get rid of CL_MEM_USE_HOST_PTR from objects then results are, writing (total) 0.3458 seconds, sgemm 4.137 seconds and reading 0.2411 seconds and wall time shows 4.9 seconds. I tried non_blocking read/write with clFinish() as well and got exactly same results from events.

So, without CL_MEM_USE_HOST_PTR, things go much quicker??? Is there a mistake in the manual???

I also tried the SDK BufferBandwidth example and this does not use HOST_PTR either...

$ ./BufferBandwidth -t 4
Device 0            Tahiti
Build:               DEBUG
GPU work items:      32768
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        1
inputBuffer:         CL_MEM_READ_ONLY
outputBuffer:        CL_MEM_WRITE_ONLY
AVERAGES (over loops 2 - 19, use -l for complete log)
--------
          PCIe B/W device->host: 0.005904 s       5.68 GB/s
          PCIe B/W host->device: 0.005211 s       6.44 GB/s
Passed!

Thanks,

Evren

Archives Discussions

CL_MEM_USE_HOST_PTR slower than not using it...