12 Replies Latest reply on Nov 11, 2010 8:33 AM by himanshu.gautam

    Let's hunt a memory leak!

    AndreasStahl
      seems to be internal to the runtime, source code included

      Hello,

      after searching my code for several months for a rather serious memory leak, I tried to reduce the problem to its core. It seems to happen either when a CommandQueue or a Kernel is being created, its arguments set, or during/after execution. I have attached a minimalistic C++ program, demonstrating this behaviour, to this message.

      What it does is the following: during setup it creates a context, gets device handles, compiles a very simple increment kernel, creates a buffer of size 8 MByte and fills it with 0.

      Then it does the following 100 times: get the command queue, create kernel from program, set buffer as kernel arg, enqueue kernel, wait for queue to finish. Allocation and deallocation is handled by the stack.

      Afterwards the buffer, program, devices and the context are manually deallocated.

      I made it so it halts

      1. before setup, 
      2. before allocating, executing, deallocing the queue and kernel 100 times, 
      3. after that, and 
      4. after I manually deallocate buffer, program, context etc.

        if you look at task manager memory usage for the process, at the second and third halting points it should roughly be equal, and also at the last halting point it should be equal to the first. But it's not. Not at all, indeed! Here are my read-outs from windows task manager, when run on DEVICE_CPU:

        1. 2,344 K
        2. 22,708 K
        3. 39,596 K
        4. 31,380 K

        so for 100 iterations, there were 3. - 2. = 16,888 K leaked. When I increase iteration count to 200, mem usage after kernel execution is 56,596 K, indicating a leak of 33,888 K!

        300 iterations: 50,948 K leaked

        400 iterations: 67,644 K leaked

        This indicates a leak of ~169 K per Iteration.

        For iteration counts over ~500, it fails during CommandQueue(), citing error code -6 -- Out of host memory.

        When I halve the buffer size, the numbers don't change.

        On DEVICE_GPU it leaks ~50 K per Iteration.

        But maybe the problem is BKAC*, so please help me identify if there is something totally wrong with my memory allocation / deallocation pattern. Should I allocate queue and kernels only once during setup? I tried this in my production code once, but as soon as I had created the commandqueue handle the program refused to respond to input via the gui.

        OS: Win7 x64

        RAM: 4 GByte

        Compiler: VC++ 2008

        Devices: Athlon x64 CPU (1 GB reported), Juniper GPU (5770, 128 MB reported)

        *) between keyboard and chair, i.e. me

        #include <CL/cl.hpp> #include <cstdio> #include <cstdlib> #include <iostream> // a VERY simple Kernel std::string kernelSource = "__kernel void inc(__global int* a){ a[get_global_id(0)] += 1; }"; const int BUFFER_ELEMENT_COUNT = 1024 * 1024 * 2; // times sizeof(cl_int) equals 8 MByte // define these as pointers, as that's how I have to do it in my production code. cl::Context *context; std::vector<cl::Device> devices; cl::Program *program; cl::Buffer *buffer; int setupCl() { // create context, get devices, build program cl_int err; context = new cl::Context(CL_DEVICE_TYPE_CPU, 0, NULL, NULL, &err); devices = context->getInfo<CL_CONTEXT_DEVICES>(); if(devices.empty()) return !CL_SUCCESS; cl::Program::Sources source(1, std::make_pair(kernelSource.c_str(), kernelSource.size())); program = new cl::Program(*context, source); program->build(devices); // create and fill the test buffer cl_int *a = new cl_int[BUFFER_ELEMENT_COUNT]; memset(a, 0, BUFFER_ELEMENT_COUNT * sizeof(cl_int)); buffer = new cl::Buffer(*context, CL_MEM_READ_WRITE, BUFFER_ELEMENT_COUNT * sizeof(cl_int)); cl::CommandQueue queue(*context, devices[0], 0, &err); queue.enqueueWriteBuffer(*buffer, CL_TRUE, 0, BUFFER_ELEMENT_COUNT * sizeof(cl_int), a); queue.finish(); delete[] a; // clear the host array return CL_SUCCESS; } void runKernel() { cl_int err; cl::Kernel kernel(*program, "inc", &err); err = kernel.setArg(0, *buffer); if(err != CL_SUCCESS){ std::cerr << "Kernel.setArg() Error: " << err << std::endl; return; } cl::CommandQueue queue(*context, devices[0], 0, &err); if(err != CL_SUCCESS){ std::cerr << "CommandQueue() Error: " << err << std::endl; return; } err = queue.enqueueNDRangeKernel( kernel, cl::NullRange, cl::NDRange(BUFFER_ELEMENT_COUNT), cl::NullRange ); if(err != CL_SUCCESS){ std::cerr << "CommandQueue.enqueueNDRangeKernel Error: " << err << std::endl; return; } err = queue.finish(); if(err != CL_SUCCESS){ std::cerr << "CommandQueue.finish() Error: " << err << std::endl; return; } } void cleanUp() { delete buffer; delete program; devices.clear(); delete context; } int main() { std::cout << "Please refer to task manager for memory read-outs" << std::endl; std::cout << "pre setup, allocated: nothing [ENTER]" << std::endl; std::cin.get(); setupCl(); std::cout << "post setup, pre kernel run, allocated: buffer, program, devices, context [ENTER]" << std::endl; std::cin.get(); for(unsigned i = 0; i < 500; i++) runKernel(); std::cout << "post kernel run, pre clean-up, allocated: buffer, program, devices, context [ENTER]" << std::endl; std::cin.get(); cleanUp(); std::cout << "post clean-up, allocated: nothing [ENTER]" << std::endl; std::cin.get(); }

          • Let's hunt a memory leak!
            AndreasStahl

            I attached a more sophisticated version, which automatically outputs memory usage after each step and calculates leak size total and per iteration.

            Also I added an inner iteration, which reuses the command queue and kernel... but there's still a leak (though much less).

            Iteration counts can be controlled via the constants INNER_ITERATION_COUNT and OUTER_ITERATION_COUNT.

            You may need to add Psapi.lib to your linker configuration.

            #include <CL/cl.hpp> #include <cstdio> #include <cstdlib> #include <iostream> #include <psapi.h> // a VERY simple Kernel std::string kernelSource = "__kernel void inc(__global int* a){ a[get_global_id(0)] += 1; }"; const int BUFFER_ELEMENT_COUNT = 1024 * 1024 * 2; // times sizeof(cl_int) equals 8 MByte const unsigned OUTER_ITERATION_COUNT = 1; const unsigned INNER_ITERATION_COUNT = 100; // define these as pointers, as that's how I have to do it in my production code. cl::Context *context; std::vector<cl::Device> devices; cl::Program *program; cl::Buffer *buffer; int setupCl() { // create context, get devices, build program cl_int err; context = new cl::Context(CL_DEVICE_TYPE_GPU, 0, NULL, NULL, &err); devices = context->getInfo<CL_CONTEXT_DEVICES>(); if(devices.empty()) return !CL_SUCCESS; cl::Program::Sources source(1, std::make_pair(kernelSource.c_str(), kernelSource.size())); program = new cl::Program(*context, source); program->build(devices); // create and fill the test buffer cl_int *a = new cl_int[BUFFER_ELEMENT_COUNT]; memset(a, 0, BUFFER_ELEMENT_COUNT * sizeof(cl_int)); buffer = new cl::Buffer(*context, CL_MEM_READ_WRITE, BUFFER_ELEMENT_COUNT * sizeof(cl_int)); cl::CommandQueue queue(*context, devices[0], 0, &err); queue.enqueueWriteBuffer(*buffer, CL_TRUE, 0, BUFFER_ELEMENT_COUNT * sizeof(cl_int), a); queue.finish(); delete[] a; // clear the host array return CL_SUCCESS; } void runKernel() { cl_int err; cl::Kernel kernel(*program, "inc", &err); err = kernel.setArg(0, *buffer); if(err != CL_SUCCESS){ std::cerr << "Kernel.setArg() Error: " << err << std::endl; return; } cl::CommandQueue queue(*context, devices[0], 0, &err); if(err != CL_SUCCESS){ std::cerr << "CommandQueue() Error: " << err << std::endl; return; } // enqueue the kernel INNER_ITERATION_COUNT times for(unsigned i = 0; i < INNER_ITERATION_COUNT; i++){ err = queue.enqueueNDRangeKernel( kernel, cl::NullRange, cl::NDRange(BUFFER_ELEMENT_COUNT), cl::NullRange ); if(err != CL_SUCCESS){ std::cerr << "CommandQueue.enqueueNDRangeKernel Error: " << err << std::endl; return; } } err = queue.finish(); if(err != CL_SUCCESS){ std::cerr << "CommandQueue.finish() Error: " << err << std::endl; return; } } unsigned readFirstBufferValue(){ unsigned result; cl::CommandQueue queue(*context, devices[0], 0); queue.enqueueReadBuffer(*buffer, CL_TRUE, 0, sizeof(cl_int), &result); queue.finish(); return result; } void cleanUp() { delete buffer; delete program; devices.clear(); delete context; } int main() { std::cout << "pre setup, allocated: nothing [ENTER]" << std::endl; size_t leak = 0; PROCESS_MEMORY_COUNTERS pmc; if ( GetProcessMemoryInfo( GetCurrentProcess(), &pmc, sizeof(pmc)) ) { std::cout << pmc.WorkingSetSize << " Bytes" << std::endl; } std::cin.get(); // perform setup setupCl(); std::cout << "post setup, pre kernel run, allocated: buffer, program, devices, context [ENTER]" << std::endl; if ( GetProcessMemoryInfo( GetCurrentProcess(), &pmc, sizeof(pmc)) ) { std::cout << pmc.WorkingSetSize << " Bytes" << std::endl; leak = pmc.WorkingSetSize; } std::cin.get(); // perform kernel INNER * OUTER_ITERATION_COUNT times for(unsigned i = 0; i < OUTER_ITERATION_COUNT; i++) runKernel(); std::cout << "post kernel run, pre clean-up, allocated: buffer, program, devices, context [ENTER]" << std::endl; if ( GetProcessMemoryInfo( GetCurrentProcess(), &pmc, sizeof(pmc)) ) { std::cout << pmc.WorkingSetSize << " Bytes" << std::endl; leak = pmc.WorkingSetSize - leak; std::cout << "leak after "<< INNER_ITERATION_COUNT * OUTER_ITERATION_COUNT; std::cout <<" iterations: " << leak << " Bytes, "; std::cout << leak / (INNER_ITERATION_COUNT * OUTER_ITERATION_COUNT)<< " Bytes per Iteration" << std::endl; std::cout << "First buffer value (expected: " << INNER_ITERATION_COUNT * OUTER_ITERATION_COUNT << "): "; std::cout << readFirstBufferValue() << std::endl; } std::cin.get(); // perform clean-up cleanUp(); std::cout << "post clean-up, allocated: nothing [ENTER]" << std::endl; if ( GetProcessMemoryInfo( GetCurrentProcess(), &pmc, sizeof(pmc)) ) { std::cout << pmc.WorkingSetSize << " Bytes" << std::endl; } std::cin.get(); }

              • Let's hunt a memory leak!
                AndreasStahl

                And here's the same code without using the c++ bindings, and releasing everything by calling clRelease... manually. Still leaks.

                Can somebody deny/confirm this? It seems to be a problem of the OpenCL runtime.

                 

                #include <CL/cl.hpp> #include <cstdio> #include <cstdlib> #include <iostream> #include <psapi.h> // a VERY simple Kernel std::string kernelSource = "__kernel void inc(__global int* a){ a[get_global_id(0)] += 1; }"; const int BUFFER_ELEMENT_COUNT = 1024 * 1024 * 2; // times sizeof(cl_int) equals 8 MByte const unsigned OUTER_ITERATION_COUNT = 100; const unsigned INNER_ITERATION_COUNT = 1; // define these as pointers, as that's how I have to do it in my production code. cl_context context; cl_device_id *devices; cl_program program; cl_mem buffer; int setupCl() { // create context, get devices, build program cl_int err; size_t deviceListSize; context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, &err); clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceListSize); if(deviceListSize == 0) return !CL_SUCCESS; devices = (cl_device_id *)malloc(deviceListSize); clGetContextInfo(context, CL_CONTEXT_DEVICES, deviceListSize, devices, NULL); const char * source = kernelSource.c_str(); size_t sourceSize[] = { kernelSource.size() }; program = clCreateProgramWithSource(context, 1, &source, sourceSize, &err); clBuildProgram(program, 1, devices, NULL, NULL, NULL); // create and fill the test buffer cl_int *a = (cl_int *) malloc(BUFFER_ELEMENT_COUNT * sizeof(cl_int)); memset(a, 0, BUFFER_ELEMENT_COUNT * sizeof(cl_int)); buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, BUFFER_ELEMENT_COUNT * sizeof(cl_int), NULL, &err); cl_command_queue queue = clCreateCommandQueue(context, devices[0], 0, &err); clEnqueueWriteBuffer(queue, buffer, CL_TRUE, 0, BUFFER_ELEMENT_COUNT * sizeof(cl_int), a, 0, NULL, NULL); clFinish(queue); free(a); // clear the host array clReleaseCommandQueue(queue); return CL_SUCCESS; } void runKernel() { cl_int err; cl_kernel kernel = clCreateKernel(program, "inc", &err); err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&buffer); if(err != CL_SUCCESS){ std::cerr << "Kernel.setArg() Error: " << err << std::endl; return; } cl_command_queue queue = clCreateCommandQueue(context, devices[0], 0, &err); if(err != CL_SUCCESS){ std::cerr << "CommandQueue() Error: " << err << std::endl; return; } // enqueue the kernel INNER_ITERATION_COUNT times size_t globalThreads[1]; globalThreads[0] = BUFFER_ELEMENT_COUNT; for(unsigned i = 0; i < INNER_ITERATION_COUNT; i++){ err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, globalThreads, NULL, 0, NULL, NULL); if(err != CL_SUCCESS){ std::cerr << "CommandQueue.enqueueNDRangeKernel Error: " << err << std::endl; return; } } clFinish(queue); clReleaseCommandQueue(queue); clReleaseKernel(kernel); if(err != CL_SUCCESS){ std::cerr << "CommandQueue.finish() Error: " << err << std::endl; return; } } cl_int readFirstBufferValue(){ cl_int result; cl_int err; cl_command_queue queue = clCreateCommandQueue(context, devices[0], 0, &err); clEnqueueReadBuffer(queue, buffer, CL_TRUE, 0, sizeof(cl_int), &result, 0, NULL, NULL); clFinish(queue); clReleaseCommandQueue(queue); return result; } void cleanUp() { clReleaseMemObject(buffer); clReleaseProgram(program); free(devices); clReleaseContext(context); } int main() { std::cout << "pre setup, allocated: nothing [ENTER]" << std::endl; size_t leak = 0; PROCESS_MEMORY_COUNTERS pmc; if ( GetProcessMemoryInfo( GetCurrentProcess(), &pmc, sizeof(pmc)) ) { std::cout << pmc.WorkingSetSize << " Bytes" << std::endl; } std::cin.get(); // perform setup setupCl(); std::cout << "post setup, pre kernel run, allocated: buffer, program, devices, context [ENTER]" << std::endl; if ( GetProcessMemoryInfo( GetCurrentProcess(), &pmc, sizeof(pmc)) ) { std::cout << pmc.WorkingSetSize << " Bytes" << std::endl; leak = pmc.WorkingSetSize; } std::cin.get(); // perform kernel INNER * OUTER_ITERATION_COUNT times for(unsigned i = 0; i < OUTER_ITERATION_COUNT; i++) runKernel(); std::cout << "post kernel run, pre clean-up, allocated: buffer, program, devices, context [ENTER]" << std::endl; if ( GetProcessMemoryInfo( GetCurrentProcess(), &pmc, sizeof(pmc)) ) { std::cout << pmc.WorkingSetSize << " Bytes" << std::endl; leak = pmc.WorkingSetSize - leak; std::cout << "leak after "<< INNER_ITERATION_COUNT * OUTER_ITERATION_COUNT; std::cout <<" iterations: " << leak << " Bytes, "; std::cout << leak / (INNER_ITERATION_COUNT * OUTER_ITERATION_COUNT)<< " Bytes per Iteration" << std::endl; std::cout << "First buffer value (expected: " << INNER_ITERATION_COUNT * OUTER_ITERATION_COUNT << "): "; std::cout << readFirstBufferValue() << std::endl; } std::cin.get(); // perform clean-up cleanUp(); std::cout << "post clean-up, allocated: nothing [ENTER]" << std::endl; if ( GetProcessMemoryInfo( GetCurrentProcess(), &pmc, sizeof(pmc)) ) { std::cout << pmc.WorkingSetSize << " Bytes" << std::endl; } std::cin.get(); }

                  • Let's hunt a memory leak!
                    nou

                    i can confirm. a compiled first program on ubuntu 9.04 x64. CPU device. here are result

                    1. 1.4MB

                    2.20MB

                    3.129.3MB

                    4.121.3MB

                    i increase the iteration to 2000 without error. that mean i get around 55kB of memory leak per iteration.

                      • Let's hunt a memory leak!
                        hazeman

                        It look like there is quite a lot of memory leaks in opencl/cal code.

                        Here is output from valgrind with --leak-check=full ( only 4 iterations of runKernel )

                         

                         

                        ==7142== 11 bytes in 1 blocks are definitely lost in loss record 8 of 190 ==7142== at 0x4C278AE: malloc (vg_replace_malloc.c:207) ==7142== by 0x5026414: amd::CodeCache::registerCode(amd::Assembler const&) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x50264D2: amd::CodeCache::init() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x502C055: amd::Runtime::init(amd::Thread*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FEE157: amd::HostThread::HostThread() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FED9B4: clCreateContextFromType (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4016C5: setupCl() (testprog.cpp:25) ==7142== by 0x40188A: main (testprog.cpp:113) ==7142== ==7142== ==7142== 16 bytes in 1 blocks are definitely lost in loss record 9 of 190 ==7142== at 0x4C278AE: malloc (vg_replace_malloc.c:207) ==7142== by 0x502DC1F: amd::CommandQueue::~CommandQueue() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x502BEEE: amd::ReferenceCountedObject::release() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FF584E: clReleaseCommandQueue (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x401871: setupCl() (testprog.cpp:44) ==7142== by 0x40188A: main (testprog.cpp:113) ==7142== ==7142== ==7142== 48 bytes in 3 blocks are definitely lost in loss record 32 of 190 ==7142== at 0x4C278AE: malloc (vg_replace_malloc.c:207) ==7142== by 0x502D007: amd::CommandQueue::finish() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FF5846: clReleaseCommandQueue (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x40165D: runKernel() (testprog.cpp:82) ==7142== by 0x401898: main (testprog.cpp:117) ==7142== ==7142== ==7142== 134 (40 direct, 94 indirect) bytes in 1 blocks are definitely lost in loss record 52 of 190 ==7142== at 0x4C2726C: operator new(unsigned long) (vg_replace_malloc.c:230) ==7142== by 0x5246F29: llvm::MemoryBuffer::getFile(char const*, std::string*, long) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x500059C: (within /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x500204E: amd::llvmLinkOptCG(std::string&, std::string&, std::string&, bool, bool) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x501792E: gpu::Program::compileSourceToBinary(std::string const&, char const*, void**, unsigned long*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x501A665: gpu::Program::compile(std::string const&, char const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x500551C: device::Program::build(std::string const*, char const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x5029235: amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FF7656: clBuildProgram (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4017A9: setupCl() (testprog.cpp:33) ==7142== by 0x40188A: main (testprog.cpp:113) ==7142== ==7142== ==7142== 88 (8 direct, 80 indirect) bytes in 1 blocks are definitely lost in loss record 57 of 190 ==7142== at 0x4C2726C: operator new(unsigned long) (vg_replace_malloc.c:230) ==7142== by 0x53F6E9D: llvm::DerivedType::dropAllTypeUses() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x53F7079: llvm::DerivedType::unlockedRefineAbstractTypeTo(llvm::Type const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x53FF969: llvm::TypeMap<llvm::PointerValType, llvm::PointerType>::RefineAbstractType(llvm::PointerType*, llvm::DerivedType const*, llvm::Type const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x53F70DF: llvm::DerivedType::unlockedRefineAbstractTypeTo(llvm::Type const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x53F7226: llvm::DerivedType::refineAbstractTypeTo(llvm::Type const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x53CEFCF: llvm::BitcodeReader::ParseTypeTable() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x53D8F34: llvm::BitcodeReader::ParseModule(std::string const&) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x53D9B1F: llvm::BitcodeReader::ParseBitcode() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x53D9EED: llvm::getBitcodeModuleProvider(llvm::MemoryBuffer*, llvm::LLVMContext&, std::string*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x53D9FEC: llvm::ParseBitcodeFile(llvm::MemoryBuffer*, llvm::LLVMContext&, std::string*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x5002947: amd::llvmLinkOptCG(std::string&, std::string&, std::string&, bool, bool) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== ==7142== ==7142== 16 bytes in 1 blocks are definitely lost in loss record 59 of 190 ==7142== at 0x4C2726C: operator new(unsigned long) (vg_replace_malloc.c:230) ==7142== by 0x4FFFC18: (within /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x5002AA5: amd::llvmLinkOptCG(std::string&, std::string&, std::string&, bool, bool) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x501792E: gpu::Program::compileSourceToBinary(std::string const&, char const*, void**, unsigned long*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x501A665: gpu::Program::compile(std::string const&, char const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x500551C: device::Program::build(std::string const*, char const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x5029235: amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FF7656: clBuildProgram (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4017A9: setupCl() (testprog.cpp:33) ==7142== by 0x40188A: main (testprog.cpp:113) ==7142== ==7142== ==7142== 170 bytes in 4 blocks are possibly lost in loss record 78 of 190 ==7142== at 0x4C2726C: operator new(unsigned long) (vg_replace_malloc.c:230) ==7142== by 0x5C3E160: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.10) ==7142== by 0x5C3EB24: (within /usr/lib/libstdc++.so.6.0.10) ==7142== by 0x5C3EC62: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.10) ==7142== by 0x4FEE0D9: amd::HostThread::HostThread() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FED9B4: clCreateContextFromType (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4016C5: setupCl() (testprog.cpp:25) ==7142== by 0x40188A: main (testprog.cpp:113) ==7142== ==7142== ==7142== 472 bytes in 1 blocks are definitely lost in loss record 100 of 190 ==7142== at 0x4C254D0: memalign (vg_replace_malloc.c:460) ==7142== by 0x4C2558A: posix_memalign (vg_replace_malloc.c:569) ==7142== by 0x5026DFC: amd::Os::alignedMalloc(unsigned long, unsigned long) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x500943F: cpu::Device::init() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x50054B5: amd::Device::init() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x502C082: amd::Runtime::init(amd::Thread*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FEE157: amd::HostThread::HostThread() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FED9B4: clCreateContextFromType (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4016C5: setupCl() (testprog.cpp:25) ==7142== by 0x40188A: main (testprog.cpp:113) ==7142== ==7142== ==7142== 960 bytes in 3 blocks are possibly lost in loss record 108 of 190 ==7142== at 0x4C25684: calloc (vg_replace_malloc.c:397) ==7142== by 0x4012215: _dl_allocate_tls (in /lib/ld-2.9.so) ==7142== by 0x6FEC5E3: pthread_create@@GLIBC_2.2.5 (in /lib/libpthread-2.9.so) ==7142== by 0x5026C82: amd::Os::createOsThread(amd::Thread*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x5031207: amd::Thread::Thread(std::string const&, unsigned long, bool) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x502D673: amd::CommandQueue::CommandQueue(amd::Context&, amd::Device&, unsigned long) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FF46C4: clCreateCommandQueue (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x401812: setupCl() (testprog.cpp:40) ==7142== by 0x40188A: main (testprog.cpp:113) ==7142== ==7142== ==7142== 1,024 bytes in 1 blocks are definitely lost in loss record 110 of 190 ==7142== at 0x4C278AE: malloc (vg_replace_malloc.c:207) ==7142== by 0x50265CD: amd::Assembler::Assembler() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x502647D: amd::CodeCache::init() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x502C055: amd::Runtime::init(amd::Thread*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FEE157: amd::HostThread::HostThread() (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FED9B4: clCreateContextFromType (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4016C5: setupCl() (testprog.cpp:25) ==7142== by 0x40188A: main (testprog.cpp:113) ==7142== ==7142== ==7142== 1,114 (192 direct, 922 indirect) bytes in 2 blocks are definitely lost in loss record 111 of 190 ==7142== at 0x4C278AE: malloc (vg_replace_malloc.c:207) ==7142== by 0x502D63F: amd::CommandQueue::CommandQueue(amd::Context&, amd::Device&, unsigned long) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FF46C4: clCreateCommandQueue (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x401585: runKernel() (testprog.cpp:57) ==7142== by 0x401898: main (testprog.cpp:117) ==7142== ==7142== ==7142== 640 bytes in 2 blocks are definitely lost in loss record 125 of 190 ==7142== at 0x4C25684: calloc (vg_replace_malloc.c:397) ==7142== by 0x4012215: _dl_allocate_tls (in /lib/ld-2.9.so) ==7142== by 0x6FEC5E3: pthread_create@@GLIBC_2.2.5 (in /lib/libpthread-2.9.so) ==7142== by 0x5026C82: amd::Os::createOsThread(amd::Thread*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x5031207: amd::Thread::Thread(std::string const&, unsigned long, bool) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x502D673: amd::CommandQueue::CommandQueue(amd::Context&, amd::Device&, unsigned long) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FF46C4: clCreateCommandQueue (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x401585: runKernel() (testprog.cpp:57) ==7142== by 0x401898: main (testprog.cpp:117) ==7142== ==7142== ==7142== 676 bytes in 1 blocks are definitely lost in loss record 126 of 190 ==7142== at 0x4C26B2C: operator new[](unsigned long) (vg_replace_malloc.c:274) ==7142== by 0x5017A2F: gpu::Program::compileSourceToBinary(std::string const&, char const*, void**, unsigned long*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x501A665: gpu::Program::compile(std::string const&, char const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x500551C: device::Program::build(std::string const*, char const*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x5029235: amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*) (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4FF7656: clBuildProgram (in /opt/ati-stream-sdk-v2.0-beta4-lnx64/lib/x86_64/libOpenCL.so) ==7142== by 0x4017A9: setupCl() (testprog.cpp:33) ==7142== by 0x40188A: main (testprog.cpp:113) ==7142== ==7142== ==7142== 258,232 bytes in 80 blocks are possibly lost in loss record 180 of 190 ==7142== at 0x4C278AE: malloc (vg_replace_malloc.c:207) ==7142== by 0x855330C: (within /usr/lib/libaticaldd.so) ==7142== by 0x8553398: (within /usr/lib/libaticaldd.so) ==7142== by 0x8300557: (within /usr/lib/libaticaldd.so) ==7142== by 0x83010A3: (within /usr/lib/libaticaldd.so) ==7142== by 0x8302A8D: (within /usr/lib/libaticaldd.so) ==7142== by 0x830355F: (within /usr/lib/libaticaldd.so) ==7142== by 0x82FFBB9: (within /usr/lib/libaticaldd.so) ==7142== by 0x82FF747: (within /usr/lib/libaticaldd.so) ==7142== by 0x82FBD2A: (within /usr/lib/libaticaldd.so) ==7142== by 0x82F3B14: (within /usr/lib/libaticaldd.so) ==7142== by 0x8221E62: (within /usr/lib/libaticaldd.so) ==7142== ==7142== LEAK SUMMARY: ==7142== definitely lost: 3,143 bytes in 15 blocks. ==7142== indirectly lost: 1,096 bytes in 13 blocks. ==7142== possibly lost: 259,362 bytes in 87 blocks. ==7142== still reachable: 19,160,701 bytes in 22,672 blocks. ==7142== suppressed: 0 bytes in 0 blocks. ==7142== Reachable blocks (those to which a pointer was found) are not shown. ==7142== To see them, rerun with: --leak-check=full --show-reachable=yes

                          • Let's hunt a memory leak!
                            nou

                            i think it is here. but from now on it is on AMD folks.

                            //with 5 runKernel() 5,760 bytes in 18 blocks are possibly lost in loss record 125 of 155 //with 50 runKernel() 48,960 bytes in 153 blocks are possibly lost in loss record 142 of 155 at 0x4C25684: calloc (vg_replace_malloc.c:397) by 0x4012215: _dl_allocate_tls (in /lib/ld-2.9.so) by 0x6FE85E3: pthread_create@@GLIBC_2.2.5 (in /lib/libpthread-2.9.so) by 0x5026C82: amd::Os::createOsThread(amd::Thread*) (in /home/nou/Plocha/GPGPU/ati-stream2.0/lib/x86_64/libOpenCL.so) by 0x5031207: amd::Thread::Thread(std::string const&, unsigned long, bool) (in /home/nou/Plocha/GPGPU/ati-stream2.0/lib/x86_64/libOpenCL.so) by 0x502D673: amd::CommandQueue::CommandQueue(amd::Context&, amd::Device&, unsigned long) (in /home/nou/Plocha/GPGPU/ati-stream2.0/lib/x86_64/libOpenCL.so) by 0x4FF46C4: clCreateCommandQueue (in /home/nou/Plocha/GPGPU/ati-stream2.0/lib/x86_64/libOpenCL.so) by 0x403461: cl::CommandQueue::CommandQueue(cl::Context const&, cl::Device const&, unsigned long, int*) (cl.hpp:3776) by 0x401840: runKernel() (leak_test.cpp:51) by 0x401F00: main (leak_test.cpp:91)

                    • Let's hunt a memory leak!
                      MicahVillmow
                      Thanks for reporting this, this is something we are looking at.
                        • Let's hunt a memory leak!
                          CodyIrons

                          Hi there hope everything is going well,

                          I was interested if this was still being looked at or if somewhere along the way it had been presumed to be fixed.  Of course i ask this because we are experiencing similar issues with the C# wrapper library CLoo, and some of our tests fall in line with what this thread was experiencing.

                          Thanks,

                          -Cody

                        • Let's hunt a memory leak!
                          Fuxianjun

                          Hi, I encountered the same problem to you did,how did you reslove it? I mean when excute the clEnqueueNDRangeKernel() method in a loop with many times , how to avoid memroy leak ? Thanks !

                            • Let's hunt a memory leak!
                              jonne

                              I'm experiencing a similar problem when I do a standard memcheck with valgrind, depending on how long my program runs, i receive up to a million errors and then valgrind stops counting the errors.

                              This occurs even when I run valgrind on sample code found on some openCL tutorials, and I use the CPU as my device in the program.

                              Could this be because of compatibility issues with the AMD implementation/driver?

                                • Let's hunt a memory leak!
                                  douglas125

                                  I have this problem too, with buffer objects.

                                  It looks like it's necessary to create all kernels, command queues and buffer objects/images beforehand.

                                  Manually disposing them doesn't work...

                                  The way I found to circumvent the issue was to create and store all kernels, command queues and buffers as a preprocessing step. Precomputing the buffers may be tricky because you can't always know the desired size but it's possible to minimize creation of new objects.