Hi to everybody.
I'm reading the latest version of the AMD APP programming guide (june 2012).
I have a problem in deeply understanding the OpenCL Memory Objects part (sec 4.5).
Wherever not specified, assume the kernel to be executed on a discrete GPU.
1) In sec 4.5.1.2 the guide says
Currently, the runtime recognizes only data that is in pinned host memory for operation arguments that are memory objects it has allocated in pinned host memory. For example, the buffer argument of clEnqueueReadBuffer /clEnqueueWriteBuffer and image argument of clEnqueueReadImage /clEnqueueWriteImage . It does not detect that the ptr arguments of these operations addresses pinned host memory, even if they are the result of clEnqueueMapBuffer /clEnqueueMapImage on a memory object that is in pinned host memory.
Now, suppose that if I create a buffer using CL_MEM_ALLOC_HOST_PTR and I get a pointer to it using clEnqueueMapBuffer to initialize the content of the buffer directly. Does the pinning happens when I create the the buffer (pre-pinning) or when I map it?
Is the pre-pinning mechanisms the same regardless the size of the memory area to be pinned?
In addition, suppose that I use the mapped ptr as src to write into a "normal" buffer (no-flags). Since the src is not recognized as pinned, what happens? Is the src copied to another pinned memory area?
Is the content of the buffer cached on the CPU when the CPU accesses it regardless the kernel access mode? (READ_ONLY, READ_WRITE, WRITE_ONLY)?
2) In sec. 4.5.2 they say
To avoid over-allocating device memory for memory objects that are never used on that device, space is not allocated until first used on a device-by-device basis.
This is quite difficult to understand. Suppose I create a buffer and I do clEnqueueWriteBuffer to initialize it. Since the guide says that allocation happens at first kernel access, where data is stored before executing the kernel (or if I do not execute any kernel)?
3) In table 4.2 it is said that CL_MEM_USE_HOST_PTR causes a copy when mapped. Nevertheless, in 4.5.4.1 the guide says CL_MEM_USE_HOT_PTR supports zero copy. Is it an error or there is something I do not understand?
Thank you very much!
Solved! Go to Solution.
The runtime would still know for which device to allocate the memory based on the command queue, so it would allocate the device memory at that time.
1) it means that it will copy data from pointer to separate memory location and then transfer to device memory. it doesn't detect that it is posible copy it directly. CL_MEM_READ is mainly to determine what buffers needed to be synchronized between multiple devices.
2) they are stored in host memory.
3) you can see that zoro copy is supported when device is CPU.
Hi nou,
well, regarding 2), I'm OK with storing data in host memory, but only for pre pinned host buffers. What for host-visible device memory buffers?
In the first case data are stored in host memory, and this is ok since the buffer is created with ALLOC_HOST_MEMORY and I can use the "temporary" host memory area as a buffer after having pinned it. No copy is required at first kernel access.
In the second case, suppose that I use CL_MEM_USE_PERSISTENT_MEM_AMD (host-visible device memory). I create the buffer and I map it to initialize it. Allocation of buffer on device is deferred until first access and you say that before executing the kernel data is stored in host memory. Therefore I conclude data used to initialized the buffer with clEnqueueWrite or memcpy is stored in host memory and only at first kernel access it is moved to the host-visible device memory area. So, where is zero copy? A copy is always made between the temporary host area and the host-visible device memory buffer created at first kernel access.
Am I wrong?
IMHO this temporary buffers are used only when is buffer supposed be located on device. otherwise it uses pre-pinned/device visible memory directly.
Well. I'm asking cause I'm trying to understand how much copies are actually done in zero-copy mode. If you are right, at least one copy due to deferred allocation has to be done.
I am not 100% sure but maybe buffer is allocated on device when you just create it without CL_MEM_COPY_HOST_PTR and then use Write/Map operation to initialize it as it does have device queue.
Hi,
I am not 100% sure too. I create a buffer used CL_MEM_COPY_HOST_PTR and CL_MEM_USE_HOST_PTR(not at the same time), and then Read the buffer, and I can get the data from the buffer. But according to the guide(space is not allocated until first used on a device-by-device basis), I think that the buffer is not allocated before first enqueue a command.
I'm reading the guide too, and i found the memory objects part the most confusing so far i wish they cleaned it up a bit
Thanks for letting us know that the programming guide needs clarification on this point. I'll send your feedback to the right person at AMD.
Cheers!
Kristen
Need for an AMD guy here
I would agree I the guide said "allocation deferred until needed". I do a clEnqueueWrite or clEnqueueMap + memcpy, so allocation is needed, and the runtime does it.
But reading that it is deferred until first kernel access makes me confused, especially regarding the implementation of zero-copy. The problem is "simple": if I do a clEnqueueWrite before executing a kernel, if data are placed in a temporary memory area waiting for buffer to be allocated and if that area doesn't match buffer flags requirements (e,g, PERSISTENT_MEM), I guess at least one copy is needed also in zero-copy mode.
EnqueueWriteBuffer by definiton does a copy, how else could that api possibly be implemented?
Hi guys,
I'm just a little confused. I think zero-copy is that we create a memory on gpu and cpu can access it directly. Besides, I think that when we create a buffer, it will not be allocated whatever it's zero-copy or not until we invoke clenqueuewritebuffer, clenqueuereadbuffer or other enqueue command. This is my thought, maybe wrong. Following is something about zero copy:http://blogs.amd.com/developer/2011/08/01/cpu-to-gpu-data-transfers-exceed-15gbs-using-apu-zero-copy...
The documentation was not very clear. "first kernel access" should be "first device access", indicating that any API call that requires the device to access a resource would cause the runtime to allocate the resource.
I hope that's more clear.
Thanks,
Jeff
Hi jeff.
What if is the host the first to access the buffer (i.e. clEnqueueWrite or clEnqueueMap + memcpy)?
The runtime would still know for which device to allocate the memory based on the command queue, so it would allocate the device memory at that time.