Archives Discussions

Raistmer · ‎01-21-2010

what effective?

My app use some buffer on GPU ~4MB size
its size always the same but time to time new data from host memory should be uploaded then it will used few times in kernels before next update from host memory.
The question is:
what course of action is better:

1)delete and recreate buffer each time when update from host needed via clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,..

or

2)allocate buffer for lifetime of app and do updates via mapping it to host adress space when needed

?

And additional question about case 2.
Can I update GPU buffer in this case directly, that is, write results of CPU computations in buffer one by one or still will need to write results in some addiional host memory based buffer and only after completion of host buffer update upload it completely to GPU buffer? What way of actions would give better performance?

ibird · ‎01-22-2010

Creating an application that call a lot of consecutive times the kernel, i finded i with a profiler that in my situation delete and recreate the buffers is a bottleneck for the performances, so i retain the buffers and reuse them, i recreate the buffer only if the buffer retained is not big enough. ( like the std::vector class )

I do not use CL_MEM_USE_HOST_PTR

but i use clEnqueueWriteBuffer and clEnqueueReadBuffer to load the input data and read the output

Raistmer · ‎01-22-2010

Thanks!
What about mapping buffer to host address space and write element by element directly to GPU memory when update from CPU required? Is it possible/faster than having additional buffer in host memory?

genaganna · ‎02-10-2010

Originally posted by: Raistmer Thanks! What about mapping buffer to host address space and write element by element directly to GPU memory when update from CPU required? Is it possible/faster than having additional buffer in host memory?

It is possible to map buffer to host address space to read from or write to. Mapping does not say any thing about performance.

Raistmer · ‎02-10-2010

Originally posted by: genaganna

It is possible to map buffer to host address space to read from or write to. Mapping does not say any thing about performance.

LoL, sure mapping doesn't say anything about performance, but I hope there are some peopels who tried such variant and can give some info about performance versus other possible methods

gaurav_garg · ‎02-12-2010

What about mapping buffer to host address space and write element by element directly to GPU memory when update from CPU required? Is it possible/faster than having additional buffer in host memory?

Direct mapping/unmapping is usually slower than using writeBuffer and copyBuffer.

I think the best way for you would be to create two CL buffers one on GPU memory and another on host address space (use CL_MEM_ALLOC_HOST_PTR or CL_USE_HOST_PTR). Now do mapping/unmapping on the host buffer and then use clEnqueueCopyBuffer to copy data from host to GPU.

This approach will make sure you have the fastest data transfer (transferring data using pinned memory) and has no overhead of creating and destroying CL buffers again and again.

nou · ‎02-12-2010

i tried oclBandwidthTest on ATI

--access=mapped

Host to Device Bandwidth, 1 Device(s), Paged memory, mapped access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            3325.4

Device to Host Bandwidth, 1 Device(s), Paged memory, mapped access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            3226.4

Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            43258.2

--access=direct

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            2793.0

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            757.9

Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            43329.9

so mapping buffer is comparable with PCIeSpeedTest numbers. so clEnqueueRead is significant slower than other approachs.

gaurav_garg · ‎02-12-2010

i tried oclBandwidthTest on ATI

--access=mapped

Host to Device Bandwidth, 1 Device(s), Paged memory, mapped access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            3325.4

Device to Host Bandwidth, 1 Device(s), Paged memory, mapped access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            3226.4

Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            43258.2

--access=direct

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            2793.0

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            757.9

Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            43329.9
so mapping buffer is comparable with PCIeSpeedTest numbers. so clEnqueueRead is significant slower than other approachs.

Just reviewed the oclBandwidthTest code, timing for maaped access doesn't seem correct. The unmap call is asynchronous and there is no wait for this call before timer stop. I am not sure if implementation provides this API as asynchronous but, if it does then the timing is wrong.

I usually take these benchmarks with a pinch of salt until I see the source code. This benchmark is released by Nvidia and might not be best for AMD's platform. e.g. If I have to benchmark pinned data transfer, I would never do it the way it is done in this benchmark. The right way would be to directly use clEnqueuqCopyBuffer rather than first mapping pinned buffer on host and then copying pointer via clEnqueueWriteBuffer.

Wow, why so asymmetric read and write? While write slower too, but read speed just terrible .... Looks more like AGP then PCI-E...

As OpenCL is implemented on top of CAL, I can guess what might be the reason. In CAL, there is no way to directly copy data from host pointer to GPU local memory and hence we have to copy data in two steps. First host pointer to CAL remote resource and then remote to local resource. And vice-versa for device to host data transfer. Usually, PCIe speed in both direction is same. The performance diffrence comes when we copy data from remote resource to host pointer.

nou · ‎02-12-2010

but this number quite corespond with my own test when i map and then unmap buffer.

GPU->CPU: 3021.56 MiB/s
CPU->GPU: 2162.36 MiB/s

float *ptr = (float*)clEnqueueMapBuffer(queue, buff[0], CL_FALSE, CL_MAP_WRITE, 0, 16*1024*1024, 0, NULL, &e_write, &err_code); clWaitForEvents(1, &e_write); clGetEventProfilingInfo(e_write, CL_PROFILING_COMMAND_START, sizeof(long long), &start, NULL); clGetEventProfilingInfo(e_write, CL_PROFILING_COMMAND_END, sizeof(long long), &end, NULL);

Raistmer · ‎02-12-2010

Thanks for replies!

Is it via clEnqueueRead, yes?:
Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 757.9

Wow, why so asymmetric read and write? While write slower too, but read speed just terrible .... Looks more like AGP then PCI-E...

Archives Discussions

Memory buffer retaining or re-creating