Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

How to release Physical Memory?

Excute kernel in a loop Physical Memory usage getting more and more.

After clSetKernelArg() method, I need to excute the same kernel in a loop,but Physical Memory usage getting more and more, I think it is memory leak, can anyone tell me how to resolve. Thanks !

for(int i=0;i<10000000;i++) { cl_int errs = clEnqueueNDRangeKernel(queue, kernel, 1, 0, &work_size, 0, 0, 0, 0); }

6 Replies
Journeyman III

You're queueing up 10 million items in the queue to be performed.  If each item enqueued has to identify the command type, the kernel used, the kernel arguments at the time the kernel is enqueued, the work dimensions, global_work_offset, global_work_size, local work size, all the events in wait list + pointer to event returned, then this is going to take up space.

At a guess, each item enqueued will take at least (sizeof(kernel_id) + size_of(kernel_args)+ sizeof(work_dim) + 3*work_dim*sizeof(size_t) + sizeof(num_events) + num_events*sizeof(event_id) + sizeof(event_id)) bytes.

This is likely to be >25 bytes per item enqueued for 1D kernels (but could be less if default values passed in aren't stored fully), and much more for 3D kernels that also have events to be waited on + lots of arguments.  If none of the items enqueued get performed before you end the loop, you will likely be using >250MB of memory, perhaps a lot more.

What kind of algorithm are you using that applies the same kernel 10 million times?  Workarounds could be to enqueue less items, eg. by waiting for commands to finish before queueing up more commands.  Or perhaps the loop inside the kernel?


Trank you very much for reply. What I want to apply the same kernel to many times is Back Propagation Neuronetwork, I want to use GPU to finish Back Propagation training algorithm because it is faster than CPU and it need too many iterations to finish train algorithm in neuronetwork. 

In your opinion, what can I do to avoid space waste? Is it work that take the kernel out of the command queue after an iteration ? 


Is there any reason you can't just wait for the kernel to finish after each execution? Rather than potentially having 10m items in a queue..


calling clFinish() after each kernel invocation can be inefficient. better option is call clFinish() after some batch of kernel invocation.

Journeyman III

Or keep track of cl_events for every x number of kernel executions, then once you've added another x items, have a "soft finish" wait. (AKA wait for all the events up to a given point to finish, but don't empty the queue entirely). That way the pipeline never has to be completely empty.


Try adding a clFlush() after a certain number of kernels have been enqueued.