Creating an application that call a lot of consecutive times the kernel, i finded i with a profiler that in my situation delete and recreate the buffers is a bottleneck for the performances, so i retain the buffers and reuse them, i recreate the buffer only if the buffer retained is not big enough. ( like the std::vector class )
I do not use CL_MEM_USE_HOST_PTR
but i use clEnqueueWriteBuffer and clEnqueueReadBuffer to load the input data and read the output
Originally posted by: Raistmer Thanks! What about mapping buffer to host address space and write element by element directly to GPU memory when update from CPU required? Is it possible/faster than having additional buffer in host memory?
It is possible to map buffer to host address space to read from or write to. Mapping does not say any thing about performance.
Originally posted by: genaganna
It is possible to map buffer to host address space to read from or write to. Mapping does not say any thing about performance.
What about mapping buffer to host address space and write element by element directly to GPU memory when update from CPU required? Is it possible/faster than having additional buffer in host memory?
Direct mapping/unmapping is usually slower than using writeBuffer and copyBuffer.
I think the best way for you would be to create two CL buffers one on GPU memory and another on host address space (use CL_MEM_ALLOC_HOST_PTR or CL_USE_HOST_PTR). Now do mapping/unmapping on the host buffer and then use clEnqueueCopyBuffer to copy data from host to GPU.
This approach will make sure you have the fastest data transfer (transferring data using pinned memory) and has no overhead of creating and destroying CL buffers again and again.
i tried oclBandwidthTest on ATI
--access=mapped
Host to Device Bandwidth, 1 Device(s), Paged memory, mapped access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3325.4
Device to Host Bandwidth, 1 Device(s), Paged memory, mapped access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3226.4
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 43258.2
--access=direct
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2793.0
Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 757.9
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 43329.9
so mapping buffer is comparable with PCIeSpeedTest numbers. so clEnqueueRead is significant slower than other approachs.
i tried oclBandwidthTest on ATI
--access=mapped
Host to Device Bandwidth, 1 Device(s), Paged memory, mapped access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3325.4
Device to Host Bandwidth, 1 Device(s), Paged memory, mapped access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3226.4
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 43258.2
--access=direct
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2793.0
Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 757.9
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 43329.9so mapping buffer is comparable with PCIeSpeedTest numbers. so clEnqueueRead is significant slower than other approachs.
Just reviewed the oclBandwidthTest code, timing for maaped access doesn't seem correct. The unmap call is asynchronous and there is no wait for this call before timer stop. I am not sure if implementation provides this API as asynchronous but, if it does then the timing is wrong.
I usually take these benchmarks with a pinch of salt until I see the source code. This benchmark is released by Nvidia and might not be best for AMD's platform. e.g. If I have to benchmark pinned data transfer, I would never do it the way it is done in this benchmark. The right way would be to directly use clEnqueuqCopyBuffer rather than first mapping pinned buffer on host and then copying pointer via clEnqueueWriteBuffer.
Wow, why so asymmetric read and write? While write slower too, but read speed just terrible .... Looks more like AGP then PCI-E...
As OpenCL is implemented on top of CAL, I can guess what might be the reason. In CAL, there is no way to directly copy data from host pointer to GPU local memory and hence we have to copy data in two steps. First host pointer to CAL remote resource and then remote to local resource. And vice-versa for device to host data transfer. Usually, PCIe speed in both direction is same. The performance diffrence comes when we copy data from remote resource to host pointer.
but this number quite corespond with my own test when i map and then unmap buffer.
GPU->CPU: 3021.56 MiB/s
CPU->GPU: 2162.36 MiB/s
float *ptr = (float*)clEnqueueMapBuffer(queue, buff[0], CL_FALSE, CL_MAP_WRITE, 0, 16*1024*1024, 0, NULL, &e_write, &err_code); clWaitForEvents(1, &e_write); clGetEventProfilingInfo(e_write, CL_PROFILING_COMMAND_START, sizeof(long long), &start, NULL); clGetEventProfilingInfo(e_write, CL_PROFILING_COMMAND_END, sizeof(long long), &end, NULL);