frankas

Segfault (race condition) running brook+ kernel with multi thread / multi GPU

Discussion created by frankas on Jan 2, 2010
Latest reply on Jan 2, 2010 by gaurav.garg

I was so pleased with my 5850 card that I got myself second card, and set them up without crossfire on my Ubuntu system

I then adapted my Mandelbulb renderer to us both GPUs, with alternate pairs of scanlines being computed on each GPU. I use posix cond signalling to simultanoisly call the same kernels from different threads on the respective GPUs. This works fine, but after a few seconds I get random segfaults like this:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x95532b90 (LWP 15616)]
0xb7f8b50a in brook::PassData::ref () from /usr/lib/libbrook.so
(gdb) bt
#0  0xb7f8b50a in brook::PassData::ref () from /usr/lib/libbrook.so
#1  0xb7f7e78b in brook::Pass::Pass () from /usr/lib/libbrook.so
#2  0xb7f76f8d in std::_Construct ()
   from /usr/lib/libbrook.so
#3  0xb7f77079 in std::__uninitialized_copy_aux<__gnu_cxx::__normal_iterator > >, brook::Pass*> () from /usr/lib/libbrook.so
#4  0xb7f77121 in std::uninitialized_copy<__gnu_cxx::__normal_iterator > >, brook::Pass*> () from /usr/lib/libbrook.so
#5  0xb7f77153 in std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator > >, brook::Pass*, brook::Pass> () from /usr/lib/libbrook.so
#6  0xb7f77249 in std::vector >::vector () from /usr/lib/libbrook.so
#7  0xb7f712b4 in KernelImpl::run () from /usr/lib/libbrook.so
#8  0xb7f7dd8d in brook::Kernel::run () from /usr/lib/libbrook.so
#9  0x08052248 in __blit_bgr24::operator() ()
#10 0x0804cd38 in Bulber::work_thread (arg=0xa122ab8) at Bulber.cpp:170
#11 0xb7b204ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#12 0xb7c1749e in clone () from /lib/tls/i686/cmov/libc.so.6

 

When another worker thread is doing the exact same kernel invocation, and can have the exact same callstack as above, or a slight variation like this:

#0  0xb7fd4991 in ?? () from /lib/ld-linux.so.2
#1  0xb7fc6f27 in ?? () from /lib/ld-linux.so.2
#2  0xb7fc72bf in ?? () from /lib/ld-linux.so.2
#3  0xb7fcbe7b in ?? () from /lib/ld-linux.so.2
#4  0xb7fd19b0 in ?? () from /lib/ld-linux.so.2
#5  0xb7f811d1 in brook::DefaultHandler () from /usr/lib/libbrook.so
#6  0xb7d86f5a in operator new () from /usr/lib/libstdc++.so.6
#7  0xb7f76d8e in __gnu_cxx::new_allocator::allocate ()
   from /usr/lib/libbrook.so
#8  0xb7f76dc2 in std::_Vector_base >::_M_allocate () from /usr/lib/libbrook.so
#9  0xb7f76dff in std::_Vector_base >::_Vector_base () from /usr/lib/libbrook.so
#10 0xb7f771a3 in std::vector >::vector () from /usr/lib/libbrook.so
#11 0xb7f712b4 in KernelImpl::run () from /usr/lib/libbrook.so
#12 0xb7f7dd8d in brook::Kernel::run () from /usr/lib/libbrook.so
#13 0x08052248 in __blit_bgr24::operator() ()
#14 0x0804cd38 in Bulber::work_thread (arg=0xa122b98) at Bulber.cpp:170
#15 0xb7b204ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#16 0xb7c1749e in clone () from /lib/tls/i686/cmov/libc.so.6

My system is quad-core, and without digging to deeply into the brook+ sources, it seems that some handling of the "Pass" classes isn't 100% thread safe.

I could try to upgrade to 2.0 - but I haven't seen any Brook+ tickets specific to this issue, so I won't get my hopes up.

There is an easy workaround, I can just use a mutex to protect all my kernel invocations. But if they are not supposed to be thread safe, it would be more elegant to make them thread safe by having a mutex inside KernelImpl::run()

 

Outcomes