Hi, I'm developing a physics simulator that involves running about 8 different kernels that run in a loop. I use events as some operations can happen in parallel, though, I'm not using an out-of-order command queue yet. That said.. after queueing hundreds of thousands of iterations.. Memory usage quickly balloons out of control and then I get a segfault.
<code>
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5d6fcd7 in amd::Command::~Command() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
(gdb) bt
#0 0x00007ffff5d6fcd7 in amd::Command::~Command() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#1 0x00007ffff5d70b1a in amd::NDRangeKernelCommand::~NDRangeKernelCommand() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#2 0x00007ffff5d7b8f8 in amd::ReferenceCountedObject::release() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#3 0x00007ffff5d6fcdc in amd::Command::~Command() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#4 0x00007ffff5d70b1a in amd::NDRangeKernelCommand::~NDRangeKernelCommand() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#5 0x00007ffff5d7b8f8 in amd::ReferenceCountedObject::release() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#6 0x00007ffff5d6fcdc in amd::Command::~Command() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#7 0x00007ffff5d70b1a in amd::NDRangeKernelCommand::~NDRangeKernelCommand() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#8 0x00007ffff5d7b8f8 in amd::ReferenceCountedObject::release() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#9 0x00007ffff5d6fcdc in amd::Command::~Command() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#10 0x00007ffff5d70b1a in amd::NDRangeKernelCommand::~NDRangeKernelCommand() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#11 0x00007ffff5d7b8f8 in amd::ReferenceCountedObject::release() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
[repeated more than 30000 times, don't know how far it goes. Aside, I've never seen a backtrace more than 100 deep.]
</code>
I'm using OpenCL 1.1 AMD-APP (898.1)
Any ideas?
Here is the bottom of the backtrace....
<code>
[SNIP!]
#261882 0x00007ffff5d6fcdc in amd::Command::~Command() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#261883 0x00007ffff5d70b1a in amd::NDRangeKernelCommand::~NDRangeKernelCommand() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#261884 0x00007ffff5d7b8f8 in amd::ReferenceCountedObject::release() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#261885 0x00007ffff5d6fcdc in amd::Command::~Command() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#261886 0x00007ffff5d70b1a in amd::NDRangeKernelCommand::~NDRangeKernelCommand() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#261887 0x00007ffff5d7b8f8 in amd::ReferenceCountedObject::release() () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#261888 0x00007ffff5d4154f in clReleaseEvent () from /usr/lib64/OpenCL/vendors/amd/libamdocl64.so
#261889 0x00000000004af5c2 in cl::detail::ReferenceHandler<_cl_event*>::release (event=0x2ac4ad00) at /usr/include/CL/cl.hpp:1086
#261890 0x00000000004b10b1 in cl::detail::Wrapper<_cl_event*>::release (this=0x12d1908) at /usr/include/CL/cl.hpp:1133
#261891 0x00000000004affe8 in cl::detail::Wrapper<_cl_event*>::~Wrapper (this=0x12d1908, __in_chrg=<optimized out>) at /usr/include/CL/cl.hpp:1103
#261892 0x00000000004af650 in cl::Event::~Event (this=0x12d1908, __in_chrg=<optimized out>) at /usr/include/CL/cl.hpp:1538
#261893 0x00000000004b36c8 in std::_Destroy<cl::Event> (__pointer=0x12d1908) at /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.3/include/g++-v4/bits/stl_construct.h:89
#261894 0x00000000004b2fdc in std::_Destroy_aux<false>::__destroy<cl::Event*> (__first=0x12d1908, __last=0x12d1910) at /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.3/include/g++-v4/bits/stl_construct.h:99
#261895 0x00000000004b2329 in std::_Destroy<cl::Event*> (__first=0x12d1900, __last=0x12d1910) at /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.3/include/g++-v4/bits/stl_construct.h:122
#261896 0x00000000004b1435 in std::_Destroy<cl::Event*, cl::Event> (__first=0x12d1900, __last=0x12d1910) at /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.3/include/g++-v4/bits/stl_construct.h:148
#261897 0x00000000004b0592 in std::vector<cl::Event, std::allocator<cl::Event> >::~vector (this=0x7fffffffd220, __in_chrg=<optimized out>)
at /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.3/include/g++-v4/bits/stl_vector.h:313
#261898 0x00000000004af153 in amethyst::lib::Universe::cl_integrate (this=0x7fffffffd330) at /home/beau/src/amethyst/trunk/lib/universe.cpp:644
#261899 0x0000000000496f3b in amethyst::lib::test_rk4 () at /home/beau/src/amethyst/trunk/lib/test.cpp:227
#261900 0x0000000000493a0b in amethyst::Console_Menu::run (this=0x97b500, command="testrk4") at /home/beau/src/amethyst/trunk/lib/console_menu.cpp:88
#261901 0x0000000000492f94 in amethyst::command_parse (command="testrk4") at /home/beau/src/amethyst/trunk/lib/console.cpp:215
#261902 0x0000000000492a9d in amethyst::start_console () at /home/beau/src/amethyst/trunk/lib/console.cpp:85
#261903 0x0000000000492848 in main (argc=1, argv=0x7fffffffd808) at /home/beau/src/amethyst/trunk/lib/main.cpp:22
</code>
What GPU are you using? I think I am having the same issue on the 7970 (Cypress is fine). Where did you find the debug symbols for the libamdocl64.so?
I'm using a 6970. Wow, I didn't know that the 7000 series was supported on Linux yet.
As far as debug symbols.. gdb did that for me automagically. I honestly forget how to pull them manually. nm -C, reports that there aren't any symbols..
I've been playing with this... and it appears to be related to the ridiculously long dependency chain. When I break up the workload by doing a clfinish() after 1000 iterations and starting the event chain over from scratch, I don't get these issues.
I don't think the events at the beginning of the queue get freed until the last event get's free'd... just speculating..
- Beau
Message was edited by: Beau Bellamy I miswrote the original message, I meant clfinish(), not clflush().