Let's hunt a memory leak!

Discussion created by AndreasStahl on Nov 17, 2009
Latest reply on Nov 11, 2010 by himanshu.gautam
seems to be internal to the runtime, source code included


after searching my code for several months for a rather serious memory leak, I tried to reduce the problem to its core. It seems to happen either when a CommandQueue or a Kernel is being created, its arguments set, or during/after execution. I have attached a minimalistic C++ program, demonstrating this behaviour, to this message.

What it does is the following: during setup it creates a context, gets device handles, compiles a very simple increment kernel, creates a buffer of size 8 MByte and fills it with 0.

Then it does the following 100 times: get the command queue, create kernel from program, set buffer as kernel arg, enqueue kernel, wait for queue to finish. Allocation and deallocation is handled by the stack.

Afterwards the buffer, program, devices and the context are manually deallocated.

I made it so it halts

  1. before setup, 
  2. before allocating, executing, deallocing the queue and kernel 100 times, 
  3. after that, and 
  4. after I manually deallocate buffer, program, context etc.

    if you look at task manager memory usage for the process, at the second and third halting points it should roughly be equal, and also at the last halting point it should be equal to the first. But it's not. Not at all, indeed! Here are my read-outs from windows task manager, when run on DEVICE_CPU:

    1. 2,344 K
    2. 22,708 K
    3. 39,596 K
    4. 31,380 K

    so for 100 iterations, there were 3. - 2. = 16,888 K leaked. When I increase iteration count to 200, mem usage after kernel execution is 56,596 K, indicating a leak of 33,888 K!

    300 iterations: 50,948 K leaked

    400 iterations: 67,644 K leaked

    This indicates a leak of ~169 K per Iteration.

    For iteration counts over ~500, it fails during CommandQueue(), citing error code -6 -- Out of host memory.

    When I halve the buffer size, the numbers don't change.

    On DEVICE_GPU it leaks ~50 K per Iteration.

    But maybe the problem is BKAC*, so please help me identify if there is something totally wrong with my memory allocation / deallocation pattern. Should I allocate queue and kernels only once during setup? I tried this in my production code once, but as soon as I had created the commandqueue handle the program refused to respond to input via the gui.

    OS: Win7 x64

    RAM: 4 GByte

    Compiler: VC++ 2008

    Devices: Athlon x64 CPU (1 GB reported), Juniper GPU (5770, 128 MB reported)

    *) between keyboard and chair, i.e. me

    #include <CL/cl.hpp> #include <cstdio> #include <cstdlib> #include <iostream> // a VERY simple Kernel std::string kernelSource = "__kernel void inc(__global int* a){ a[get_global_id(0)] += 1; }"; const int BUFFER_ELEMENT_COUNT = 1024 * 1024 * 2; // times sizeof(cl_int) equals 8 MByte // define these as pointers, as that's how I have to do it in my production code. cl::Context *context; std::vector<cl::Device> devices; cl::Program *program; cl::Buffer *buffer; int setupCl() { // create context, get devices, build program cl_int err; context = new cl::Context(CL_DEVICE_TYPE_CPU, 0, NULL, NULL, &err); devices = context->getInfo<CL_CONTEXT_DEVICES>(); if(devices.empty()) return !CL_SUCCESS; cl::Program::Sources source(1, std::make_pair(kernelSource.c_str(), kernelSource.size())); program = new cl::Program(*context, source); program->build(devices); // create and fill the test buffer cl_int *a = new cl_int[BUFFER_ELEMENT_COUNT]; memset(a, 0, BUFFER_ELEMENT_COUNT * sizeof(cl_int)); buffer = new cl::Buffer(*context, CL_MEM_READ_WRITE, BUFFER_ELEMENT_COUNT * sizeof(cl_int)); cl::CommandQueue queue(*context, devices[0], 0, &err); queue.enqueueWriteBuffer(*buffer, CL_TRUE, 0, BUFFER_ELEMENT_COUNT * sizeof(cl_int), a); queue.finish(); delete[] a; // clear the host array return CL_SUCCESS; } void runKernel() { cl_int err; cl::Kernel kernel(*program, "inc", &err); err = kernel.setArg(0, *buffer); if(err != CL_SUCCESS){ std::cerr << "Kernel.setArg() Error: " << err << std::endl; return; } cl::CommandQueue queue(*context, devices[0], 0, &err); if(err != CL_SUCCESS){ std::cerr << "CommandQueue() Error: " << err << std::endl; return; } err = queue.enqueueNDRangeKernel( kernel, cl::NullRange, cl::NDRange(BUFFER_ELEMENT_COUNT), cl::NullRange ); if(err != CL_SUCCESS){ std::cerr << "CommandQueue.enqueueNDRangeKernel Error: " << err << std::endl; return; } err = queue.finish(); if(err != CL_SUCCESS){ std::cerr << "CommandQueue.finish() Error: " << err << std::endl; return; } } void cleanUp() { delete buffer; delete program; devices.clear(); delete context; } int main() { std::cout << "Please refer to task manager for memory read-outs" << std::endl; std::cout << "pre setup, allocated: nothing [ENTER]" << std::endl; std::cin.get(); setupCl(); std::cout << "post setup, pre kernel run, allocated: buffer, program, devices, context [ENTER]" << std::endl; std::cin.get(); for(unsigned i = 0; i < 500; i++) runKernel(); std::cout << "post kernel run, pre clean-up, allocated: buffer, program, devices, context [ENTER]" << std::endl; std::cin.get(); cleanUp(); std::cout << "post clean-up, allocated: nothing [ENTER]" << std::endl; std::cin.get(); }