We encounter memory leak problems and the app profiler can't help. (we simply use windows resource monitor.)
A test code sample is given and the code structure is simple:
cl_kernel memset;
cl_command_queue Q1, Q2;
for(;;) {
cl_event Q1_evt[51];
cl_event Q2_evt[50];
clEnqueueNDRangeKernel(Q1, memset, ..., 0, NULL, &Q1_evt[0]);
for (int iter = 0; iter < 50; iter++) {
clEnqueueNDRangeKernel(Q2, memset, ..., 1, &Q1_evt[iter], &Q2_evt[iter]);
clEnqueueNDRangeKernel(Q1, memset, ..., 1, &Q2_evt[iter], &Q1_evt[iter+1]);
}
for (int i=0; i<50; i++) {
clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
}
clReleaseEvent(Q1_evt[50]);
clFinish(Q1);
clFinish(Q2);
}
it is a twisted waiting structure that Q1 waits for Q2 and vise versa.
i guess clReleaseEvent() may not function right, so i create a thread here: http://devgurus.amd.com/message/1283971
my environment:
HD 7970, APP SDK 2.7, Catalyst 12.6 Beta (8.98-120522a-139735E-ATI), win7 64bit
Hi levyfan,
You used for(;;){}, why? And I think the inner loop is ok.
You can change it to something like " for(int k=0; k<1000; k++){} ".
In win7 resource monitor, we find that the program memory usage grows quickly and never goes down.
for(;;) {
cl_event Q1_evt[51];
cl_event Q2_evt[50];
clEnqueueNDRangeKernel(queue, memset, 1, NULL, &n, NULL, 0, NULL, &Q1_evt[0]);
error = clFlush(queue);
if (error != CL_SUCCESS)
{
printf("clFlush failed err = %d\n", error);
return(error);
}
cl_int eventStatus = CL_QUEUED;
while(eventStatus != CL_COMPLETE)
{
error = clGetEventInfo(
Q1_evt[0],
CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int),
&eventStatus,
NULL);
if (error != CL_SUCCESS)
{
printf("clGetEventInfo failed err = %d\n", error);
return(error);
}
}
for (int iter = 0; iter < 50; iter++) {
clEnqueueNDRangeKernel(secondaryQueue, memset, 1, NULL, &n, NULL, NULL, NULL, &Q2_evt[iter]);
eventStatus = CL_QUEUED;
while(eventStatus != CL_COMPLETE)
{
error = clGetEventInfo(
Q2_evt[iter],
CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int),
&eventStatus,
NULL);
if (error != CL_SUCCESS)
{
printf("clGetEventInfo failed err = %d\n", error);
return(error);
}
}
clEnqueueNDRangeKernel(queue, memset, 1, NULL, &n, NULL, 0, NULL, &Q1_evt[iter+1]);
eventStatus = CL_QUEUED;
while(eventStatus != CL_COMPLETE)
{
error = clGetEventInfo(
Q1_evt[iter+1],
CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int),
&eventStatus,
NULL);
if (error != CL_SUCCESS)
{
printf("clGetEventInfo failed err = %d\n", error);
return(error);
}
}
}
for (int i=0; i<50; i++) {
clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
}
clReleaseEvent(Q1_evt[50]);
clFinish(queue);
clFinish(secondaryQueue);
}
The memory usage is steady.
i see...
the idea of the original code is to flush 51 commands to Q1 and 50 commands to Q2, and then wait for them to be completed.
and in your code, the host execute a kernel and wait for it, and then execute another and wait again.
i guess there could be some leaks when different Queues are synced by opencl events.
so i simplify my code and the memory still leaks:
for(;;) {
cl_event Q1_evt, Q1_evt_leak;
cl_event Q2_evt;
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);
clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 1, &Q2_evt, &Q1_evt_leak);
clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
clReleaseEvent(Q1_evt_leak);
clFinish(Q1);
clFinish(Q2);
}
and it's quite weird that the memory is stable if the 3rd clEnqueueNDRangeKernel with Q1_evt_leak is not called as follows:
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);
clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);
clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
anyone please?
I'm not sure about this: Maybe after you executed
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);
clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt); // checkpoint1
and then, you begin run
clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
but maybe the checkpoint1 has not finished yet. Just speculating, you can also erase the loop to get the profile, whether the release operation is executed after the kernel finished.
it has nothing to do with the checkpoint execution. even if i call clFinish before clReleaseEvent, the memory still leaks:
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);
clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 1, &Q2_evt, &Q1_evt_leak);
clFinish(Q1);
clFinish(Q2);
clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
clReleaseEvent(Q1_evt_leak);
clFinish(Q1);
clFinish(Q2);
for(;;) {
cl_event Q1_evt, Q1_evt_leak;
cl_event Q2_evt;
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);
clFinish(Q1);
clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);
clFinish(Q2);
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 1, &Q2_evt, &Q1_evt_leak);
clFinish(Q1);
clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
clReleaseEvent(Q1_evt_leak);
clFinish(Q1);
clFinish(Q2);
}
Your code is ok, but that's not what we want.
We provide a test case to demonstrate a possible memory leak bug, and the real code is far more complicated. The main idea is that the host batches lots of gpu kernels to different queues and those kernels are well synced by events. At the same time, the host cpus do lots of other heavy works.
In my test code, the reference count of cl_event Q1_evt is 1 at the end of the loop. That's why there are memory leaks.
So, if we call clReleaseEvent(Q1_evt) twice, the leak is prevented. But this is unacceptable if we modify the real code like that.
Confused! Last time, I told you use while(eventStatus != CL_COMPLETE), and you said it would failed when using different Queues. But I think it would wok if they have the same context. I'm not sure about this. You can have a try.
hi wenju
you do not get what i meant.
in your code, the host execute a kernel and wait for it, and then execute another and wait again. but in our project, we do not want the host to wait again and again.
so, what you suggest would not help us. your solution just hide the memory leak bug rather than reveal it and fix it.
BTW, what i said "i guess there could be some leaks when different Queues are synced by opencl events.", i meant that the memory leak problem may be related to multiple queues operation in one context.
I see,
cl_event Q1_evt, Q1_evt_leak;
cl_event Q2_evt;
cl_int eventStatus;
for(;;) {
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);
clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 1, &Q2_evt, &Q1_evt_leak);
eventStatus = CL_QUEUED;
while(eventStatus != CL_COMPLETE)
{
error = clGetEventInfo(
Q1_evt_leak,
CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int),
&eventStatus,
NULL);
if (error != CL_SUCCESS)
{
printf("clGetEventInfo failed err = %d\n", error);
return(error);
}
}
clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
clReleaseEvent(Q1_evt_leak);
clFinish(Q1);
clFinish(Q2);
}
no cpu wait, no memory leak, kernels synced by event.
Hi Wenju
There is a dead-lock/loop in your code. You can detect it by adding printf like this:
for(int i; ; i++) {
printf("%d\n", i);
...
...
...
printf("%d\n", i);
}
Because of this dead-lock/loop, you will not see the memory grows up. But in fact, i suppose that the cl_event is never set to CL_COMPLETE and the host is just looping in your while(eventStatus != CL_COMPLETE).
So, your code above reveals another possible bug.
Nice Thanx to share this.
Anyone?
I think the problem is that you've called clReleaseEvent().
You don't need to call it explicitly in that case, it just messes up the reference counter in the event.
When the work items finish, the reference counter in the event gets decremented automatically.
Your code starts the next iteration before the tasks got finished in the previous one.