Archives Discussions

levyfan · ‎09-17-2012

We encounter memory leak problems and the app profiler can't help. (we simply use windows resource monitor.)

A test code sample is given and the code structure is simple:

cl_kernel memset;

cl_command_queue Q1, Q2;

for(;;) {

cl_event Q1_evt[51];

cl_event Q2_evt[50];

clEnqueueNDRangeKernel(Q1, memset, ..., 0, NULL, &Q1_evt[0]);

for (int iter = 0; iter < 50; iter++) {

clEnqueueNDRangeKernel(Q2, memset, ..., 1, &Q1_evt[iter], &Q2_evt[iter]);

clEnqueueNDRangeKernel(Q1, memset, ..., 1, &Q2_evt[iter], &Q1_evt[iter+1]);

}

for (int i=0; i<50; i++) {

clReleaseEvent(Q1_evt);

clReleaseEvent(Q2_evt);

}

clReleaseEvent(Q1_evt[50]);

clFinish(Q1);

clFinish(Q2);

}

it is a twisted waiting structure that Q1 waits for Q2 and vise versa.

i guess clReleaseEvent() may not function right, so i create a thread here: http://devgurus.amd.com/message/1283971

my environment:

HD 7970, APP SDK 2.7, Catalyst 12.6 Beta (8.98-120522a-139735E-ATI), win7 64bit

Wenju · ‎09-18-2012

Hi levyfan,

You used for(;;){}, why? And I think the inner loop is ok.

levyfan · ‎09-18-2012

You can change it to something like " for(int k=0; k<1000; k++){} ".

In win7 resource monitor, we find that the program memory usage grows quickly and never goes down.

Wenju · ‎09-18-2012

for(;;) {
cl_event Q1_evt[51];
cl_event Q2_evt[50];

clEnqueueNDRangeKernel(queue, memset, 1, NULL, &n, NULL, 0, NULL, &Q1_evt[0]);
error = clFlush(queue);
   if (error != CL_SUCCESS)
{
   printf("clFlush failed err = %d\n", error);
   return(error);
}

cl_int eventStatus = CL_QUEUED;
while(eventStatus != CL_COMPLETE)
{
   error = clGetEventInfo(
                        Q1_evt[0],
                        CL_EVENT_COMMAND_EXECUTION_STATUS,
                        sizeof(cl_int),
                        &eventStatus,
                        NULL);
   if (error != CL_SUCCESS)
   {
    printf("clGetEventInfo failed err = %d\n", error);
    return(error);
   }
}

for (int iter = 0; iter < 50; iter++) {

   clEnqueueNDRangeKernel(secondaryQueue, memset, 1, NULL, &n, NULL, NULL, NULL, &Q2_evt[iter]);
   eventStatus = CL_QUEUED;
   while(eventStatus != CL_COMPLETE)
   {
    error = clGetEventInfo(
                        Q2_evt[iter],
                        CL_EVENT_COMMAND_EXECUTION_STATUS,
                        sizeof(cl_int),
                        &eventStatus,
                        NULL);
    if (error != CL_SUCCESS)
    {
     printf("clGetEventInfo failed err = %d\n", error);
     return(error);
    }
   }

   clEnqueueNDRangeKernel(queue, memset, 1, NULL, &n, NULL, 0, NULL, &Q1_evt[iter+1]);
   eventStatus = CL_QUEUED;
   while(eventStatus != CL_COMPLETE)
   {
    error = clGetEventInfo(
                        Q1_evt[iter+1],
                        CL_EVENT_COMMAND_EXECUTION_STATUS,
                        sizeof(cl_int),
                        &eventStatus,
                        NULL);
    if (error != CL_SUCCESS)
    {
     printf("clGetEventInfo failed err = %d\n", error);
     return(error);
    }
   }

}
for (int i=0; i<50; i++) {
clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
}
clReleaseEvent(Q1_evt[50]);

clFinish(queue);
clFinish(secondaryQueue);
}

The memory usage is steady.

levyfan · ‎09-18-2012

i see...

the idea of the original code is to flush 51 commands to Q1 and 50 commands to Q2, and then wait for them to be completed.

and in your code, the host execute a kernel and wait for it, and then execute another and wait again.

i guess there could be some leaks when different Queues are synced by opencl events.

levyfan · ‎09-19-2012

so i simplify my code and the memory still leaks:

for(;;) {

cl_event Q1_evt, Q1_evt_leak;

cl_event Q2_evt;

clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);

clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);

clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 1, &Q2_evt, &Q1_evt_leak);

clReleaseEvent(Q1_evt);

clReleaseEvent(Q2_evt);

clReleaseEvent(Q1_evt_leak);

clFinish(Q1);

clFinish(Q2);

}

and it's quite weird that the memory is stable if the 3rd clEnqueueNDRangeKernel with Q1_evt_leak is not called as follows:

clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);

clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);

clReleaseEvent(Q1_evt);

clReleaseEvent(Q2_evt);

anyone please?

Wenju · ‎09-20-2012

I'm not sure about this: Maybe after you executed

clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);

clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt); // checkpoint1

and then, you begin run

clReleaseEvent(Q1_evt);

clReleaseEvent(Q2_evt);

but maybe the checkpoint1 has not finished yet. Just speculating, you can also erase the loop to get the profile, whether the release operation is executed after the kernel finished.

levyfan · ‎09-20-2012

it has nothing to do with the checkpoint execution. even if i call clFinish before clReleaseEvent, the memory still leaks:

clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);

clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);

clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 1, &Q2_evt, &Q1_evt_leak);

clFinish(Q1);

clFinish(Q2);

clReleaseEvent(Q1_evt);

clReleaseEvent(Q2_evt);

clReleaseEvent(Q1_evt_leak);

clFinish(Q1);

clFinish(Q2);

Wenju · ‎09-20-2012

for(;;) {
cl_event Q1_evt, Q1_evt_leak;
cl_event Q2_evt;

clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);
clFinish(Q1);
clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);
clFinish(Q2);
clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 1, &Q2_evt, &Q1_evt_leak);
clFinish(Q1);

clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
clReleaseEvent(Q1_evt_leak);

clFinish(Q1);
clFinish(Q2);
}

levyfan · ‎09-21-2012

Your code is ok, but that's not what we want.

We provide a test case to demonstrate a possible memory leak bug, and the real code is far more complicated. The main idea is that the host batches lots of gpu kernels to different queues and those kernels are well synced by events. At the same time, the host cpus do lots of other heavy works.

In my test code, the reference count of cl_event Q1_evt is 1 at the end of the loop. That's why there are memory leaks.

So, if we call clReleaseEvent(Q1_evt) twice, the leak is prevented. But this is unacceptable if we modify the real code like that.

Wenju · ‎09-21-2012

Confused! Last time, I told you use while(eventStatus != CL_COMPLETE), and you said it would failed when using different Queues. But I think it would wok if they have the same context. I'm not sure about this. You can have a try.

levyfan · ‎09-21-2012

hi wenju

you do not get what i meant.

in your code, the host execute a kernel and wait for it, and then execute another and wait again. but in our project, we do not want the host to wait again and again.

so, what you suggest would not help us. your solution just hide the memory leak bug rather than reveal it and fix it.

BTW, what i said "i guess there could be some leaks when different Queues are synced by opencl events.", i meant that the memory leak problem may be related to multiple queues operation in one context.

Wenju · ‎09-21-2012

I see,

cl_event Q1_evt, Q1_evt_leak;
cl_event Q2_evt;
cl_int eventStatus;
for(;;) {

clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 0, NULL, &Q1_evt);

clEnqueueNDRangeKernel(Q2, memset, 1, NULL, &N, NULL, 1, &Q1_evt, &Q2_evt);

clEnqueueNDRangeKernel(Q1, memset, 1, NULL, &N, NULL, 1, &Q2_evt, &Q1_evt_leak);

   eventStatus = CL_QUEUED;
while(eventStatus != CL_COMPLETE)
{
   error = clGetEventInfo(
                        Q1_evt_leak,
                        CL_EVENT_COMMAND_EXECUTION_STATUS,
                        sizeof(cl_int),
                        &eventStatus,
                        NULL);
   if (error != CL_SUCCESS)
   {
    printf("clGetEventInfo failed err = %d\n", error);
    return(error);
   }
}

clReleaseEvent(Q1_evt);
clReleaseEvent(Q2_evt);
clReleaseEvent(Q1_evt_leak);
clFinish(Q1);
clFinish(Q2);
}

no cpu wait, no memory leak, kernels synced by event.

levyfan · ‎09-21-2012

Hi Wenju

There is a dead-lock/loop in your code. You can detect it by adding printf like this:

for(int i; ; i++) {

printf("%d\n", i);

...

printf("%d\n", i);

}

Because of this dead-lock/loop, you will not see the memory grows up. But in fact, i suppose that the cl_event is never set to CL_COMPLETE and the host is just looping in your while(eventStatus != CL_COMPLETE).

So, your code above reveals another possible bug.

peakitde · ‎09-22-2012

Nice Thanx to share this.

levyfan · ‎09-25-2012

Anyone?

bahamut · ‎03-10-2013

I think the problem is that you've called clReleaseEvent().

You don't need to call it explicitly in that case, it just messes up the reference counter in the event.

When the work items finish, the reference counter in the event gets decremented automatically.

Your code starts the next iteration before the tasks got finished in the previous one.

Archives Discussions

Possible memory leak