Hi! I have posted this question on Stack overflow, and I thought I would post it here too, since I am using AMD's SDK for OpenCL development, and the solution could be implementation defined.
You can read the full question above, but I will summarize here. Given a pipeline of kernel operations like:
data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.
I need all the intermediate results to be copied back to the host as well. I want to make everything as asynchronous as possible by specifying the minimal event dependencies (so reads only depend previous kernel execution, and kernels don't care about reads).
I have a few questions about managing the memory objects:
So the general question is, how do large task trees interact with large memory objects?
I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.
Thank you.
If your sumtotal is greater than GPU memory, you should probably allocate-use-release (clReleaseMemObject).
The APP Programming Guide (from AMD) says that buffers are "physically" allocated only when the kernels are launched (only when the cl_mem object is referenced by the kernel) and not before that....(lazy allocation) but the implementation could just "reserve" and not allocate -- in which case,you really cannot have these phantom buffers that exceed the device limitation.
Moreover, every OpenCL device has a "Maximum" size of a single allocation - Check CL_DEVICE_MAX_MEM_ALLOC_SIZE
Usually for GPUs, the minimum size for this property is 1/4th of total global memory size
One more thing that we have to note is that Buffer is owned by the "Context" -- So when you allocate a Buffer, we can hope that the OpenCL run-time will check the max_alloc_size of each of its constituent devices and fail any allocations that cannot be met in future.
Thanks for your reply!
I am aware of the CL_DEVICE_MAX_MEM_ALLOC_SIZE limitation, and I make sure not to exceed it on individual buffers.
Having to make sure that all memory objects don't exceed a total size, throws a wrench in the works when it comes to asynchronous task trees. It basically makes them unusable since you have to always keep track of what computations are happening and allocate accordingly, making everything basically synchronous. This will create a lot of unnecessary overhead and synchronization
Any other suggestions?
You can check out Execution-Transfer overlap. You can execute a kernel, and at the same time, prepare buffers for next kernel call. This will add some asynchronous behavior for your case if you are using the events intelligently.
You can look at smart pointers - the way how "Bolt" handles cl_mem objects.
The moment the reference is lost (or) goes out of scope, the pointer class will make sure the cl_mem object is released.
This way, you can just concentrate on the code without worrying when to release the memory object.
In fact, since Bolt is available as template library, you can check out their smart pointer class source (if available).
You may have to score through some files to figure out.
This can probably help.
Alternatively, you can multi-thread your application (if your logic allows) and allow each thread to cary out some indepenent pipeline. You can write a small OpenCL memory manager for your application - with the ability to sleep a thread when memory is not available (and wake it up when available). This way multiple-threads can go easy on the memory pressure. These are just some random thoughts. Hope this makes sense. Please use your judgement.
if you exceed device memory size with your buffers you will get CL_OUT_OF_RESOURCES or error. at least on AMD implementation. you can consider device memory objects migration to free device memory. also if you enqueue kernel with buffer it should increase reference counter so you should be able release mem object just after enqueue. i am not absolutly sure you should check it.
HI Kholdstare,
Are your doubts clear now? If needed, we can discuss further to resolve your issues.
Thanks,