I would like to allocate a region of host memory that I can then use to transfer data to and from GPU devices at full speed. This is trivially easy using CUDA and either one of these functions;
I understand that there is no straightforward way to do this using OpenCL but that both Nvidia and AMD suggest the same workaround involving an OpenCL buffer that is supposed to be allocated by the runtime as pinned host memory and is then accessible using mapping.
The description provided by AMD is in Section 4.6 of the July 2012 'AMD Accelerated Parallel Processing OpenCL Programming Guide'. My understanding of the process is that you create an OpenCL buffer using clCreateBuffer() and either the CL_MEM_ALLOC_HOST_PTR flag or the CL_MEM_USE_HOST_PTR flag combined with a pointer to previously allocated memory aligned to 256 bytes using for example posix_memalign(). You can then transfer data between this buffer and a device buffer using either clEnqueueCopyBuffer() or clEnqueueWriteBuffer()/clEnqueueReadBuffer(). Of course you also have to use clEnqueueMapBuffer()/clEnqueueUnmapMemObject() in order to make the host pointer available. In summary the patterns are;
Host to device transfer.
map -> use -> unmap -> copy
map -> use -> write -> unmap
Device to host transfer.
copy -> map -> use -> unmap
map -> read -> use -> unmap
Unfortunately, this does not work quite as intended on my system. Although the bandwidth using these patterns is as high as expected, the 'pre-pinned' buffer consumes device memory on whatever device is associated with the command queue passed to either clEnqueueMapBuffer() or clEnqueueCopyBuffer() as soon as these functions are called. I really hope it is a bug that will be fixed and not a 'feature'. I think it is self explanatory why you do not want to consume scarce device memory for buffers that are not used by the device.
Arch Linux 64 bit
Is there any way to allocate a pinned host pointer that does not also consume device memory?