8 Replies Latest reply on Aug 3, 2010 10:28 PM by jeff_golds

    Memory leaks in non-blocking clEnqueueReadBuffer and clEnqueueWriteBuffer

    psath

      I'm developing an application that relies heavily on non-blocking buffer reads and writes. I'm running it multiple times without exiting to shell for benchmarking purposes and I've noticed a slight issue.

      Whenever a read/write is enqueued in non-blocking mode, it creates an addtional reference to the memory object (as checked by clGetMemObjectInfo). This is to be expected, and a good thing. It keeps you from deallocating the memory while a read is still pending.

      However, clWaitForEvents doesn't decrement the reference counter for that object as would be expected (and as performed by the nVidia OpenCL driver). You can manually get around this by calling clReleaseMemObject on the object everytime you use clWaitForEvents, but it's more than mildly inconvenient and rather non-intuitive.

      If you're using nonblocking transers and getting apparent memory leaks, this is probably your issue, try calling clGetMemObjectInfo before your final deallocations and checking that the count is 1. If not, there's a rather good chance it'll be 1 + the number of non-blocking reads/writes to that object.

      If you've already caught this AMD, then kudos to you! If not, any chance of getting a patch into the next version of Stream?

      Cheers!

      System Info: (1) Fedora Core 12 64bit, nVidia 256.35 Development Drivers + OpenCL 1.0 CUDA, AMD Catalyst 10.5 Drivers + OpenCL 1.0 ATI-Stream-v2.1 (145), GeForce GTX 480 Radeon 5870, 4x Opteron 2216

      (2) Fedora Core 12 64bit, nVidia 256.35 Development Drivers + OpenCL 1.0 CUDA, OpenCL 1.0 ATI-Stream-v2.1 (145), 2x Quadro FX 5600, 4x Opteron 2216

       

        • Memory leaks in non-blocking clEnqueueReadBuffer and clEnqueueWriteBuffer
          omkaranathan

          psath,

          Do you have a testcase which reproduces the issue?

            • Memory leaks in non-blocking clEnqueueReadBuffer and clEnqueueWriteBuffer
              Illusio

              I don't think you really need a testcase. As far as I can tell it always happens. I'm on Windows 7 btw(64 bit windows, 32 bit test application), so it's likely platform independent.

              However, I've attached some code that reproduces the problem and writes out some debug info. On my system, the refCountBefore and refCountAfter variables just keep incrementing when looping around the asynch buffer read at the end.

               

              // OpenCL C tests.cpp : Defines the entry point for the console application. // #include "stdafx.h" const std::string hw("Hello World\n"); inline void checkErr(cl_int err, const char * name) { if (err != CL_SUCCESS) { std::cerr << "ERROR: " << name << " (" << err << ")" << std::endl; exit(EXIT_FAILURE); } } int _tmain(int argc, _TCHAR* argv[]) { cl_int err; cl::vector< cl::Platform > platformList; cl::Platform::get(&platformList); checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get"); std::cerr << "Platform number is: " << platformList.size() << std::endl; std::string platformVendor; platformList[0].getInfo((cl_platform_info)CL_PLATFORM_VENDOR, &platformVendor); std::cerr << "Platform is by: " << platformVendor << "\n"; cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0}; cl::Context context( CL_DEVICE_TYPE_CPU, cprops, NULL, NULL, &err); checkErr(err, "Conext::Context()"); char * outH = new char[hw.length()+1]; cl::Buffer outCL( context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, hw.length()+1, outH, &err); checkErr(err, "Buffer::Buffer()"); cl::vector<cl::Device> devices; devices = context.getInfo<CL_CONTEXT_DEVICES>(); checkErr( devices.size() > 0 ? CL_SUCCESS : -1, "devices.size() > 0"); std::string prog = "#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable\n __constant char hw[] = \"Hello World\\n\";__kernel void hello(__global char * out){size_t tid = get_global_id(0);out[tid] = hw[tid];}"; cl::Program::Sources source( 1, std::make_pair(prog.c_str(), prog.length()+1)); cl::Program program(context, source); err = program.build(devices,""); cl::Kernel kernel(program, "hello", &err); checkErr(err, "Kernel::Kernel()"); err = kernel.setArg(0, outCL); checkErr(err, "Kernel::setArg()"); cl::CommandQueue queue(context, devices[0], 0, &err); checkErr(err, "CommandQueue::CommandQueue()"); cl::Event event; err = queue.enqueueNDRangeKernel( kernel, cl::NullRange, cl::NDRange(hw.length()+1), cl::NDRange(1, 1), NULL, &event); checkErr(err, "ComamndQueue::enqueueNDRangeKernel()"); event.wait(); for( int i=0; i<50; i++ ) { cl_uint refCountBefore; outCL.getInfo( CL_MEM_REFERENCE_COUNT, &refCountBefore ); err = queue.enqueueReadBuffer( outCL, CL_FALSE, 0, hw.length()+1, outH, NULL, &event); checkErr(err, "ComamndQueue::enqueueReadBuffer()"); event.wait(); std::cout << outH; cl_uint refCountAfter; outCL.getInfo( CL_MEM_REFERENCE_COUNT, &refCountAfter ); std::cerr << "Iteration " << i << ": RefCount before: " << refCountBefore << " RefCount after: " << refCountAfter << "\n"; } return 0; }

                • Memory leaks in non-blocking clEnqueueReadBuffer and clEnqueueWriteBuffer
                  Illusio

                  Roflmao! Nevermind. The reference count is held by the event. Once you release it. The counter decrements correctly. I suppose NVidia decrements it on completion of the asynch operation, which may be a bit more intuitive, but the reference count shouldn't increment wildly on ATI's implementation when events are correctly released.

                   

                    • Memory leaks in non-blocking clEnqueueReadBuffer and clEnqueueWriteBuffer
                      genaganna

                       

                      Originally posted by: Illusio Roflmao! Nevermind. The reference count is held by the event. Once you release it. The counter decrements correctly. I suppose NVidia decrements it on completion of the asynch operation, which may be a bit more intuitive, but the reference count shouldn't increment wildly on ATI's implementation when events are correctly released.

                       

                       



                      Illusio,

                              Thanks for giving test case.  With internal builds, this test case is working fine.  It will be available in upcoming releases.

                      This is what i am getting on my XP64 system

                       

                      Platform number is: 1 Platform is by: Advanced Micro Devices, Inc. Hello World Iteration 0: RefCount before: 1 RefCount after: 1 Hello World Iteration 1: RefCount before: 1 RefCount after: 1 Hello World Iteration 2: RefCount before: 1 RefCount after: 1 Hello World Iteration 3: RefCount before: 1 RefCount after: 1 Hello World Iteration 4: RefCount before: 1 RefCount after: 1 Hello World Iteration 5: RefCount before: 1 RefCount after: 1 Hello World Iteration 6: RefCount before: 1 RefCount after: 1 Hello World Iteration 7: RefCount before: 1 RefCount after: 1 Hello World Iteration 8: RefCount before: 1 RefCount after: 1 Hello World Iteration 9: RefCount before: 1 RefCount after: 1 Hello World Iteration 10: RefCount before: 1 RefCount after: 1 Hello World Iteration 11: RefCount before: 1 RefCount after: 1 Hello World Iteration 12: RefCount before: 1 RefCount after: 1 Hello World Iteration 13: RefCount before: 1 RefCount after: 1 Hello World Iteration 14: RefCount before: 1 RefCount after: 1 Hello World Iteration 15: RefCount before: 1 RefCount after: 1 Hello World Iteration 16: RefCount before: 1 RefCount after: 1 Hello World Iteration 17: RefCount before: 1 RefCount after: 1 Hello World Iteration 18: RefCount before: 1 RefCount after: 1 Hello World Iteration 19: RefCount before: 1 RefCount after: 1 Hello World Iteration 20: RefCount before: 1 RefCount after: 1 Hello World Iteration 21: RefCount before: 1 RefCount after: 1 Hello World Iteration 22: RefCount before: 1 RefCount after: 1 Hello World Iteration 23: RefCount before: 1 RefCount after: 1 Hello World Iteration 24: RefCount before: 1 RefCount after: 1 Hello World Iteration 25: RefCount before: 1 RefCount after: 1 Hello World Iteration 26: RefCount before: 1 RefCount after: 1 Hello World Iteration 27: RefCount before: 1 RefCount after: 1 Hello World Iteration 28: RefCount before: 1 RefCount after: 1 Hello World Iteration 29: RefCount before: 1 RefCount after: 1 Hello World Iteration 30: RefCount before: 1 RefCount after: 1 Hello World Iteration 31: RefCount before: 1 RefCount after: 1 Hello World Iteration 32: RefCount before: 1 RefCount after: 1 Hello World Iteration 33: RefCount before: 1 RefCount after: 1 Hello World Iteration 34: RefCount before: 1 RefCount after: 1 Hello World Iteration 35: RefCount before: 1 RefCount after: 1 Hello World Iteration 36: RefCount before: 1 RefCount after: 1 Hello World Iteration 37: RefCount before: 1 RefCount after: 1 Hello World Iteration 38: RefCount before: 1 RefCount after: 1 Hello World Iteration 39: RefCount before: 1 RefCount after: 1 Hello World Iteration 40: RefCount before: 1 RefCount after: 1 Hello World Iteration 41: RefCount before: 1 RefCount after: 1 Hello World Iteration 42: RefCount before: 1 RefCount after: 1 Hello World Iteration 43: RefCount before: 1 RefCount after: 1 Hello World Iteration 44: RefCount before: 1 RefCount after: 1 Hello World Iteration 45: RefCount before: 1 RefCount after: 1 Hello World Iteration 46: RefCount before: 1 RefCount after: 1 Hello World Iteration 47: RefCount before: 1 RefCount after: 1 Hello World Iteration 48: RefCount before: 1 RefCount after: 1 Hello World Iteration 49: RefCount before: 1 RefCount after: 1

                        • Memory leaks in non-blocking clEnqueueReadBuffer and clEnqueueWriteBuffer
                          psath

                          Sorry on the late reply, was away from internet access this weekend. Thanks for providing a testcase Illusio, it appears to execute similarly enough to generate the bug.

                          So let me get this straight, you're saying in the current release the event is holding the additional references to my memory object? Because releasing the event before releasing the memory still hasn't helped. Of course, I am reusing the event, so maybe I have to release it multiple times?

                          As long as I don't have to use code like this in the upcoming release I suppose we'll be fine >.<

                          if (writeWait != NULL) { clWaitForEvents(1, &writeWait); if (deviceIsAMD) clReleaseMemObject(DevSrc[itr% 2]); }

                            • Memory leaks in non-blocking clEnqueueReadBuffer and clEnqueueWriteBuffer
                              Illusio

                              I'm saying that the test code worked without any leaks on my machine when I did the moral equivalent of a clReleaseEvent() after each wait().(Sadly, I don't have the code anymore, and I don't remember exactly what changes I made. I suppose I should have posted it, but I was so embarrassed for not having spotted my bug in the first place that I thought an appology was in order and not more code.

                              Anyway, You can't reuse events. I'm not even sure what you mean when you say that. Every time you call enqueue, you'll get a new event id returned and each has to be released separately.

                              If you don't do this, you'll likely have a(small) memory leak even if the problem with the reference counters goes away in a new release.

                               

                               

                                • Memory leaks in non-blocking clEnqueueReadBuffer and clEnqueueWriteBuffer
                                  psath

                                  so I can't do something like this without memory leaks? I'm using the same cl_event variable multiple times without ever releasing it.

                                  //initialize OpenCL variables cl_event myEvent = NULL; for (somenumberofiterations) { //do stuff clEnqueueNDRangeKernel(..., &myEvent); //more stuff } clReleaseEvent(myEvent); //tear down the rest of the variables

                                    • Memory leaks in non-blocking clEnqueueReadBuffer and clEnqueueWriteBuffer
                                      jeff_golds

                                      You may be reusing the same variable, but not the same memory.  When you call clEnqueueNDRangeKernel() with an event pointer, the runtime returns storage to you.  So if you call clEnqueueNDRangeKernel() with the same pointer, you lose the storage returned previously.

                                      If you aren't using the event, then you don't need to provide a pointer at all, or just pass in the event pointer on the final iteration.

                                      If you really want all those events saved, then you need to keep an array of pointers to the events so that you don't lose the storage.

                                      From the OpenCL spec:

                                      event returns an event object that identifies this particular kernel execution instance. Event objects are unique and can be used to identify a particular kernel execution instance later on. If event is NULL, no event will be created for this kernel execution instance and therefore it will not be possible for the application to query or queue a wait for this particular kernel execution instance.


                                      Jeff