cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sajis997
Adept I

more on memory flags

Hi forum,

I am some confusion on the use of memory flags that i want to be clear with.

1. CL_MEM_USE_HOST_PTR - it is only valid if the host pointer is not null. If, specified , it indicates that the application wants the OpenCL implementation to use memory referenced by the host pointer as the storage bits for the memory object. Does it mean that memory object created on the device basically refers to the memory initialized on the host and the memory resides on the host, no memory data is copied to the device from the host.

2. CL_MEM_ALLOC_HOST_PTR - specifies that the buffer should be allocated in from host-accessible memory. Does it mean that buffer will be allocated on the device using the host memory , but no memory data will be copied to the device ?

3. CL_MEM_COPY_HOST_PTR - if specified, then it indicates that the application wants the OpenCL implementation to allocate memory for the memory object and copy data from memory referenced by host pointer. It means that memory object is allocated on the device and data is copied from the host to device while the creation of the buffer.

Did i understand it right?

Regards

Sajjadul

0 Likes
9 Replies
himanshu_gautam
Grandmaster

  1. CL_MEM_USE_HOST_PTR - it is only valid if the host pointer is not null. If, specified , it indicates that the application wants the OpenCL implementation to use memory referenced by the host pointer as the storage bits for the memory object. Does it mean that memory object created on the device basically refers to the memory initialized on the host and the memory resides on the host, no memory data is copied to the device from the host.

USE_HOST_PTR is pageable memory. i.e. You can even pass memory allocated by malloc() as USE_HOST_PTR. GPU cannot DMA or Access this memory because OS may swap these pages anytime out of RAM. So, ultimately these get silently copied by the OpenCL run-time on a need-basis. However, latest developments, include pre-pinning support for such pointers. i.e. the RT will pin the data in memory and use them as DMA src/dst targets to optimize the copies. Pinning is a costly op and can delay.

Always MAP() to get the latest contents.

2. CL_MEM_ALLOC_HOST_PTR - specifies that the buffer should be allocated in from host-accessible memory. Does it mean that buffer will be allocated on the device using the host memory , but no memory data will be copied to the device ?

ALLOC_HOST_PTR will leave the memory allocation to OpenCL RT and Runtime will allocate PINNED MEMORY - that the OS will not move to swap. GPUs can DMA to this location. Depending on whether VM is enabled in driver/hardware (check your driver version string in clinfo and see if it has substring (VM) on it), GPU Kernel Pointers can be mapped to access directly from this RAM area (instead of GPU's global memory). If your kernel is accessing this again and again, It will SLOW down your kernel by a lot.

A better way to use AHP (Alloc Host Ptr) is to fill in some data and pass it as a SRC argument to "clEnqueueCopyBuffer". This will make sure that GPU DMAs this data into its global memory. Since the memory is pinned and possibly physically contiguous too, DMA will be much faster. Write once, copy many times is very ideal use-case. (OR) You can overlap PCIe DMA with CPU writes using a double-buffer technique (2 AHP buffers)

3. CL_MEM_COPY_HOST_PTR - if specified, then it indicates that the application wants the OpenCL implementation to allocate memory for the memory object and copy data from memory referenced by host pointer. It means that memory object is allocated on the device and data is copied from the host to device while the creation of the buffer.

Yes, this is correct. Depending on size of the buffer, the RT might use an intermediate pinned buffer (and double-buffering technique) to effectively copy the data to GPU

See section 4.2 of AMD APP Programming guide and subsequent memory optimization sections to know more about how AMD's runtime handles different scenarios.

- Bruhaspati

Hi Bruhaspati,

Is that true that pinned or page-locked memory transfers attain the highest bandwidth between the host and device ?

I think OpenCL applications do not have the direct control over if memory objects are allocated on in the pinned memory or not, but they can create object using the CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in the pinned memory by the driver for the best performance. At the same time i got to know that pinned memory should not be over-used. Excessive use can reduce overall system performance. I believe that you also mentioned something similar in your last post over this issue.

You mentioned a better use of the Pinned memory though which is not clear to me yet. Let me explain if i got your explanation.

1 .Fill in some data  - should i use clEnqueueMapBuffer() / clEnqueueMapImage() and then use the pinned pointer to fill with the data using memset() or memcpy().

Any more thoughts ?

0 Likes

>> 1 .Fill in some data  - should i use clEnqueueMapBuffer() / clEnqueueMapImage() and then use the pinned

>> pointer to fill with the data using memset() or memcpy().

First AHP and create a buffer, MAP() it and you will get a PTR to pinned location. MEMCPY() into it (Write into the buffer).UNMAP() it. Now, you have AHP buffer in Pinned Memory with Valid data contents.

You can now use this buffer to COPY into normal device buffers (e.g. clEnqueueCopyBuffer) - This will follow DMA path and will be very fast.

Write once to AHP and using it many times to copy into device is a good use-case.

Dont use a AHP buffer as argument to kernels because on Hardwares supporting VM, the kernel will access this buffer via PCIe and this is dead-slow compared to buffers residing inside GPU (8GBps versus 200GBps).

You can consider creating many such AHP buffers, using them as double-buffers -- so that you can MEMCPY() into a AHP buffer while DMA to device is happening on another AHP buffer. Double-buffering is still a contention on RAM (one from CPU side for MEMCPY and other from device-side on DMA). So, How effective is double-buffering -- is a question that only experiments can answer.

You can also consider pre-pinned transfers of UHP as a reasnable alternative. Just make sure your UHP host buffer is aligned nicely (4096 is a good number though 64/128/256 should also work). The RT takes care of pre-pinning...So you dont need to do anything on your part. When you use UHP pointers are kernel arguments, RT will pre-pin and copy it to the device transparently....Always MAP() to get the most recent contents of the buffer (because the buffer may actually be residing on the device). Section 4.2 of programming guide explains a lot of these stuff. Please read it.

- Bruhaspati

UHP pre-prinning is very specific to how AMD's OpenCL Runtime. It is an implementation thing.

It can be totally different on a different platform.

AHP is usually interpretted as pinned memory on Host side - for PCIe DMA. } in other implementations as well.

If you are looking for portability -- you may want to consider these facts. FYI

- Bruha...

0 Likes

So it means that for every buffer i want to create, it is a good idea to allocate a pinned buffer along with the main buffer for faster data transfer. Lets try with the following scenario:

1. I have allocated a normal device buffer and a pinned device buffer

    cl_mem buffer;

    cl_mem pinnedBuffer; // with the CL_MEM_ALLOC_HOST_PTR

2. I want to copy data from host to device. I could have used directly clEnqueueWriteBuffer(), but for faster data transfer i shall go with the following procedure instead:

2.1. I shall mapped the pinned buffer.

2.2 Copy the host content to the pinned buffer using memcpy();

2.3 Unmap the pinned buffer. Do i need to call clFinish() after unmapping ?

2.4 Copy from the pinned buffer to the normal buffer using clEnqueueCopyBuffer()

I think i just re-iterated what you have just mentioned in the last post. Did i understood your previous post properly?

Thanks

Sajjadul

0 Likes

Yes. This is right. And will work fine on any platform. clFinish() is not required. unmap() would suffice. Make sure MAP/UNMAP are Blocking (if at all there is this concept of blocking for them...Vaguely remember them to have..)

Alternatively, on AMD Platform, You can use a UHP (use host ptr) buffer which are pinned at buffer creation time. Check for pre-pinned support in AMD APP Programming guide and read the relevant sections.. You can use such a buffer to copy into a device buffer using clEnqueueCopyBuffer(). This is ideal case where there is only DMA. The UHP pointer must be aligned to 256 bytes (4096 will be ideal). However, this is AMD Runtime specific.


Using UHP/AHP as kernel arguments will result in disastrous performance. They are better used to copy application buffers onto device buffers to achieve peak PCIe interconnect bandwidth. They will also help overlap DMA with kernel execution if used properly.

Depending on the size of the buffer, the RT will pin and copy via DMA (or) Use a double buffer technique as listed above internally...

Check the programming guide. They have documented what they do for sizes < 32MB and greater than that one etc....

NOTE: Pinning can be costly depending on size of the buffer.

What I would recommend is to try both the methods for a variety of sizes say 1MB to 128MB and then plot a graph.

Try two different platforms say AMD and NVIDIA.

Then, you will know what works good on AMD platform, What will work good on both platforms etc....

0 Likes

Thanks for the hint.

I believe that the rest of the scenarios will also follow the same path. If i want to read stuff from device to host, then i do the following:

1. Copy the contents from the normal device buffer to the pinned device buffer.

2. map the pinned device buffer with the reading flag.

3. copy the mapped host pointer to the normal host pointer.

4. unmap the pinned device buffer.

0 Likes

I guess step 3 is not required. Once you have the data in pinned memory, your CPU can read it at the maximum speed anyhow. Although I beleive map/unmap will still be faster than clEnqueueReadBuffer (as that would need another pointer to copy data to).

0 Likes

>> 1. Copy the contents from the normal device buffer to the pinned device buffer.

Using clEnqueueCopyBuffer(). Fine.

>> 2. map the pinned device buffer with the reading flag.

Fine

>> 3. copy the mapped host pointer to the normal host pointer.

Fine

>> 4. unmap the pinned device buffer.

Fine

There are things to note here:

1) As Himanshu said, you can directly read from Pinned memory. But this is UNCACHED and SLOW. If you have to read this repeatedly, you better do the "copy" i.e. step 3

2) If your kernel writes into output buffer only once per workitem (in a coalesced fashion) -- then you can directly use the AHP buffer as Kernel Argument. You must have allocated AHP with WRITE_ONLY flag.

When all these are followed on GPUs supporting VM (Most recent GPUs do), the kernel will directly write via PCIe bus onto the system memory. This way, you can avoid STEP1 - wherein you copy from device to Pinned memory. This happens as the Kernel runs......

NOTE: My assumptions on coalesced writes to output buffer is based on "common sense" reasoning. You really dont want multipled trips to a bus (be it GPU bus, or system bus or PCIe bus...)

3) Create a CL_MEM object using UHP (on your application memory) without any flags. Just do a clEnqueueCopyBuffer() from device buffer to this buffer. You are done. You dont need 4 steps. And, this will go via DMA (if my understanding of APP Programming guide is correct)

4) A BAD Way to do this:

Create the output buffer with WRITE_ONLY | AMD_PERSISTENT flag. This buffer will reside inside the GPU. MAP() it and you will get a POINTER that points right into the GPU device (passing through the PCIe bus directly - PCI Memory Mapped IO). After the kernel is over, Just MEMCPY() it over to application buffer and then UNMAP it.

This  is against the general recommendation for AMD_PERSISTENT flag.. because READs from PERSISTENT memory will be SLOW and is usually used for creating INPUTs to kenels (through Streaming Writes through PCIe)

Experiment and find out -- And, Please let us know what worked for you!

Thanks,

- Bruhaspati

0 Likes