cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

kholdstare
Journeyman III

Memory considerations when enqueing a long sequence of kernels and reads

Hi! I have posted this question on Stack overflow, and I thought I would post it here too, since I am using AMD's SDK for OpenCL development, and the solution could be implementation defined.

http://stackoverflow.com/questions/14351261/memory-considerations-when-enqueing-a-long-sequence-of-k...

You can read the full question above, but I will summarize here. Given a pipeline of kernel operations like:

data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc. 

I need all the intermediate results to be copied back to the host as well. I want to make everything as asynchronous as possible by specifying the minimal event dependencies (so reads only depend previous kernel execution, and kernels don't care about reads).

I have a few questions about managing the memory objects:

  • Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
  • Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which will fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads/DMAs?

So the general question is, how do large task trees interact with large memory objects?

I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.

Thank you.

0 Likes
6 Replies
himanshu_gautam
Grandmaster

If your sumtotal is greater than GPU memory, you should probably allocate-use-release (clReleaseMemObject).

The APP Programming Guide (from AMD) says that buffers are "physically" allocated only when the kernels are launched (only when the cl_mem object is referenced by the kernel) and not before that....(lazy allocation) but the implementation could just "reserve" and not allocate -- in which case,you really cannot have these phantom buffers that exceed the device limitation.

Moreover, every OpenCL device has a "Maximum" size of a single allocation - Check CL_DEVICE_MAX_MEM_ALLOC_SIZE

Usually for GPUs, the minimum size for this property is 1/4th of total global memory size

One more thing that we have to note is that Buffer is owned by the "Context" -- So when you allocate a Buffer, we can hope that the OpenCL run-time will check the max_alloc_size of each of its constituent devices and fail any allocations that cannot be met in future.

Thanks for your reply!

I am aware of the CL_DEVICE_MAX_MEM_ALLOC_SIZE limitation, and I make sure not to exceed it on individual buffers.

Having to make sure that all memory objects don't exceed a total size, throws a wrench in the works when it comes to asynchronous task trees. It basically makes them unusable since you have to always keep track of what computations are happening and allocate accordingly, making everything basically synchronous. This will create a lot of unnecessary overhead and synchronization

Any other suggestions?

0 Likes

You can check out Execution-Transfer overlap. You can execute a kernel, and at the same time, prepare buffers for next kernel call. This will add some asynchronous behavior for your case if you are using the events intelligently.

0 Likes

You can look at smart pointers - the way how "Bolt" handles cl_mem objects.

The moment the reference is lost (or) goes out of scope, the pointer class will make sure the cl_mem object is released.

This way, you can just concentrate on the code without worrying when to release the memory object.

In fact, since Bolt is available as template library, you can check out their smart pointer class source (if available).

You may have to score through some files to figure out.

This can probably help.

Alternatively, you can multi-thread your application (if your logic allows) and allow each thread to cary out some indepenent pipeline. You can write a small OpenCL memory manager for your application - with the ability to sleep a thread when memory is not available (and wake it up when available). This way multiple-threads can go easy on the memory pressure. These are just some random thoughts. Hope this makes sense. Please use your judgement.

0 Likes
nou
Exemplar

if you exceed device memory size with your buffers you will get CL_OUT_OF_RESOURCES or error. at least on AMD implementation. you can consider device memory objects migration to free device memory. also if you enqueue kernel with buffer it should increase reference counter so you should be able release mem object just after enqueue. i am not absolutly sure you should check it.

0 Likes
himanshu_gautam
Grandmaster

HI Kholdstare,

Are your doubts clear now? If needed, we can discuss further to resolve your issues.

Thanks,

0 Likes