cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dkissick
Journeyman III

cl_mem objects in multidevice contexts

Can anyone help me understand how the function clCreateBuffer is implemented in multidevice contexts, specifically with respect to the different available flags (USE, COPY, and ALLOC _HOST_PTR).

If I have a multidevice context with 3 discrete GPUs and use the COPY_HOST_PTR flag, to which device is the data physically copied? If it's copied to one device, will it be automatically recopied to another device if I try to use it in another device's command queue?

If I have a multidevice context with a APU (CPU and GPU) and a discrete GPU, what do these flags do? I think (someone correct me if not) COPY and ALLOC result in zero copy for CPU and APU devices, but does the extra discrete GPU change things, especially in light of the previous question about which device the data is copied to?

Thanks in advance!

0 Likes
1 Solution
himanshu_gautam
Grandmaster

dkissick wrote:

Can anyone help me understand how the function clCreateBuffer is implemented in multidevice contexts, specifically with respect to the different available flags (USE, COPY, and ALLOC _HOST_PTR).

USE_HOST_PTR (UHP) - After clCreateBuffer(), the host-pointer is owned by OpenCL Runtime.

If you need latest data, do a Map() and examine the data. When you map, you have control of the buffer.

When you are done with your work, you can Unmap() and relinquish control to OpenCL runtime.

The opencl run-time (rt) decides when to copy and which device to copy. This purely depends on what you do with this buffer.

And run-time is quite smart.

If you don't use the buffer, the run-time won't even allocate anything.

You may want to check clEnqueueMigrateMemObjects() API in OpenCL 1.2 (note that this API was introduced only in 1.2. So Before using it, you need to give a thought on what machines you want your code to run -- Note that NVIDIA has not released 1.2 yet..)

COPY_HOST_PTR - When you want to retain control of the host-pointer, after buffer creation, go for this flag. It is upto the run-time to decide where to copy.

ALLOC_HOST_PTR -- This is a hint to the opencl run-time to allocate the buffer from "pinned" system memory (which probably is also physically contiguous). This will be helpful in DMAing data back and forth. Too much of pinning will hit your swap subsystem and hence poor virtual memory performance. (Pinning - means OS cannot swap out the page to the disk and hence the page will be phsyically resident in RAM -- so that the GPU card can DMA at anytime without wondering if the OS had swapped out the page)

CPU Reads from this buffer are terribly slow (non-cached). However writes are combined and hence work at the system-bus bandwidth speed.

With VM enabled AMD card/driver (this is an AMD specific concept), OpenCL Kernels can directly access this pinned host-memory using pointers. This overlaps kernel execution with PCIe transactions. But GPUs are too fast for PCIe and most of the time the kernel will be stalling on memory accesses. And so, this is not really such a wonderful thing.

Accessing memroy like this is called "zero-copy" because the buffer is physically stationed at host RAM and devices directly access them (without making copies inside them)

Note that even if you dont use AHP, AMD's run-time will still use DMA to transfer "clEnqueueWriteBuffer(), ReadBuffer()" calls by pinning the buffers temporarily during the transaction. FYI.

HTH

View solution in original post

0 Likes
7 Replies
nou
Exemplar

yes it will be copied where it need to be. refer to AMD OpenCL prograing guide to how flags affect buffers placement.

0 Likes

Thanks for the quick reply. Table 4.2 was very helpful, but this table brings up the question, What does VM stand for here and how do I tell if it is enabled/ enable it?

0 Likes

It stands for Virtual Memory. It should be enabled for 7xxx series card and you can check it in device OpenCL version string. run clinfo and there should be VM

VM indicates the ability of the GPU card to serve memory accesses of OpenCL kernels from pinned system memory in the host.

i.e. if you "clCreateBuffer" with ALLOC_HOST_PTR (AHP) and pass this as an argument to your kernel, your kernel will be reading directly from pinned system memory (provided the GPU card and Driver support VM - as seen in clinfo)

You can check out buffer-bandwidth and you can see that the kernel's write bandwidth with AHP is only around 5GBps. However, if the buffer is allocated on GPU, you can see ~100GBps (assuming VM enabled)

With this knowledge of VM, Table 4.2 should be a cakewalk to understand.

All the best!

And, Please post here if you have any doubts.

Thanks,

0 Likes
himanshu_gautam
Grandmaster

dkissick wrote:

Can anyone help me understand how the function clCreateBuffer is implemented in multidevice contexts, specifically with respect to the different available flags (USE, COPY, and ALLOC _HOST_PTR).

USE_HOST_PTR (UHP) - After clCreateBuffer(), the host-pointer is owned by OpenCL Runtime.

If you need latest data, do a Map() and examine the data. When you map, you have control of the buffer.

When you are done with your work, you can Unmap() and relinquish control to OpenCL runtime.

The opencl run-time (rt) decides when to copy and which device to copy. This purely depends on what you do with this buffer.

And run-time is quite smart.

If you don't use the buffer, the run-time won't even allocate anything.

You may want to check clEnqueueMigrateMemObjects() API in OpenCL 1.2 (note that this API was introduced only in 1.2. So Before using it, you need to give a thought on what machines you want your code to run -- Note that NVIDIA has not released 1.2 yet..)

COPY_HOST_PTR - When you want to retain control of the host-pointer, after buffer creation, go for this flag. It is upto the run-time to decide where to copy.

ALLOC_HOST_PTR -- This is a hint to the opencl run-time to allocate the buffer from "pinned" system memory (which probably is also physically contiguous). This will be helpful in DMAing data back and forth. Too much of pinning will hit your swap subsystem and hence poor virtual memory performance. (Pinning - means OS cannot swap out the page to the disk and hence the page will be phsyically resident in RAM -- so that the GPU card can DMA at anytime without wondering if the OS had swapped out the page)

CPU Reads from this buffer are terribly slow (non-cached). However writes are combined and hence work at the system-bus bandwidth speed.

With VM enabled AMD card/driver (this is an AMD specific concept), OpenCL Kernels can directly access this pinned host-memory using pointers. This overlaps kernel execution with PCIe transactions. But GPUs are too fast for PCIe and most of the time the kernel will be stalling on memory accesses. And so, this is not really such a wonderful thing.

Accessing memroy like this is called "zero-copy" because the buffer is physically stationed at host RAM and devices directly access them (without making copies inside them)

Note that even if you dont use AHP, AMD's run-time will still use DMA to transfer "clEnqueueWriteBuffer(), ReadBuffer()" calls by pinning the buffers temporarily during the transaction. FYI.

HTH

0 Likes
himanshu_gautam
Grandmaster

dkissick wrote:

If I have a multidevice context with 3 discrete GPUs and use the COPY_HOST_PTR flag, to which device is the data physically copied? If it's copied to one device, will it be automatically recopied to another device if I try to use it in another device's command queue?

Yes, an OpenCL Buffer belongs to a context and not to a device. Request you to look at Appendix A.1 (Shared OpenCL Objects) in the OpenCL specification.

If the two kernels running in 2 devices are interested in the same buffer, the application should take care of serializing them (See Appendix A.1)

Once this synch is taken care of, the run-time will take care of shuttling around the buffers transparently.

0 Likes
himanshu_gautam
Grandmaster

dkissick wrote:

If I have a multidevice context with a APU (CPU and GPU) and a discrete GPU, what do these flags do? I think (someone correct me if not) COPY and ALLOC result in zero copy for CPU and APU devices, but does the extra discrete GPU change things, especially in light of the previous question about which device the data is copied to?

Table 4.2 of the AMD APP Programming Guide will help (As nou had indicated).

0 Likes