Archives Discussions

vanja_z · ‎08-05-2012

I would like to allocate a region of host memory that I can then use to transfer data to and from GPU devices at full speed. This is trivially easy using CUDA and either one of these functions;

cuMemHostRegister()/cudaHostRegister()

cuMemAllocHost()/cudaHostAlloc()

I understand that there is no straightforward way to do this using OpenCL but that both Nvidia and AMD suggest the same workaround involving an OpenCL buffer that is supposed to be allocated by the runtime as pinned host memory and is then accessible using mapping.

The description provided by AMD is in Section 4.6 of the July 2012 'AMD Accelerated Parallel Processing OpenCL Programming Guide'. My understanding of the process is that you create an OpenCL buffer using clCreateBuffer() and either the CL_MEM_ALLOC_HOST_PTR flag or the CL_MEM_USE_HOST_PTR flag combined with a pointer to previously allocated memory aligned to 256 bytes using for example posix_memalign(). You can then transfer data between this buffer and a device buffer using either clEnqueueCopyBuffer() or clEnqueueWriteBuffer()/clEnqueueReadBuffer(). Of course you also have to use clEnqueueMapBuffer()/clEnqueueUnmapMemObject() in order to make the host pointer available. In summary the patterns are;

Host to device transfer.

map -> use -> unmap -> copy

map -> use -> write -> unmap

Device to host transfer.

copy -> map -> use -> unmap

map -> read -> use -> unmap

Unfortunately, this does not work quite as intended on my system. Although the bandwidth using these patterns is as high as expected, the 'pre-pinned' buffer consumes device memory on whatever device is associated with the command queue passed to either clEnqueueMapBuffer() or clEnqueueCopyBuffer() as soon as these functions are called. I really hope it is a bug that will be fixed and not a 'feature'. I think it is self explanatory why you do not want to consume scarce device memory for buffers that are not used by the device.

Testing system:

HD6950

Arch Linux 64 bit

Catalyst 12.6

Question:

Is there any way to allocate a pinned host pointer that does not also consume device memory?

Regards,

Vanja

Wenju · ‎08-05-2012

Hi Vanja,

http://devgurus.amd.com/message/1282336#1282336,http://blogs.amd.com/developer/2011/08/01/cpu-to-gpu-data-transfers-exceed-15gbs-using-apu-zero-copy...

This may be useful.

vanja_z · ‎08-09-2012

Hi Wenju, no those links don't help. Can anyone from AMD chime in on this?

Wenju · ‎08-09-2012

As far as I know, only these functions support prepinned memory:

clEnqueueRead/WriteBuffer,

clEnqueueRead/WriteImage,

clEnqueueRead/WriteBufferRect(Windows only).

And if you want to do some data transfer, the buffer that used CL_MEM_USE_HOST_PTR flag will always exist in prepinned memory.

vanja_z · ‎08-12-2012

Wenju you do not appear to understand the question. I understand how to use pre-pinned memory according to the description in the programming guide and am getting expected speeds in my benchmark program.

My problem is that I want to allocate pre-pinned host memory only and each of the methods outlined in the programming guide allocates pre-pinned host memory and a potentially unused dummy buffer on the device.

Wenju · ‎08-12-2012

Sorry, maybe I didn't explain it clearly. 1. Create the buffer with CL_MEM_USE_HOST_PTR. 2. Only using following enqueue command can not allocate device buffer: clEnqueueRead/WriteBuffer,clEnqueueRead/WriteImage,clEnqueueRead/WriteBufferRect. 3. Never use the buffer as a kernel argument. Following the above three conditions, I think you can get what you want. But if you want to use this kind of buffer in a kernel, I think it's impossible. If a buffer has not been used in a kernel, why still creating it. In short, no way. Sorry again.

yurtesen · ‎08-17-2012

If you use CL_MEM_USE_HOST_PTR the data is copied to device memory when you map it (according to table 4.2 on amd opencl guide) unless if the device is CPU, otherwise it would be a zero-copy operation.

I think you should use CL_MEM_ALLOC_HOST_PTR, however, previously I found out some undocumented problems about this on Linux. The buffer should be smaller than ~200mb. (there is a setting which can be set to increase this limit but you can set it only 1/8 of total memory you have in your machine at most). Larger buffer objects cause strange things to happen. For example map/unmap zero copy behavior is not working (does not work only on Linux, it appears to work good on Windows). I dont know what the AMD OpenCL does instead of zero-copy, but it does something slow...

I imagine it could also be possible to create another context with only CPU device. Therefore the memory object you create in that context wont be copied to a GPU device. *I think* it should still be possible to copy data from one memory object to another in different contexts. Did you try that? (I might be wrong of course)...

Archives Discussions

Pre-pinned buffer consuming device memory