1 Reply Latest reply on Jan 2, 2010 6:10 AM by gaurav.garg

    Segfault (race condition) running brook+ kernel with multi thread / multi GPU

    frankas

      I was so pleased with my 5850 card that I got myself second card, and set them up without crossfire on my Ubuntu system

      I then adapted my Mandelbulb renderer to us both GPUs, with alternate pairs of scanlines being computed on each GPU. I use posix cond signalling to simultanoisly call the same kernels from different threads on the respective GPUs. This works fine, but after a few seconds I get random segfaults like this:

      Program received signal SIGSEGV, Segmentation fault.
      [Switching to Thread 0x95532b90 (LWP 15616)]
      0xb7f8b50a in brook::PassData::ref () from /usr/lib/libbrook.so
      (gdb) bt
      #0  0xb7f8b50a in brook::PassData::ref () from /usr/lib/libbrook.so
      #1  0xb7f7e78b in brook::Pass::Pass () from /usr/lib/libbrook.so
      #2  0xb7f76f8d in std::_Construct ()
         from /usr/lib/libbrook.so
      #3  0xb7f77079 in std::__uninitialized_copy_aux<__gnu_cxx::__normal_iterator > >, brook::Pass*> () from /usr/lib/libbrook.so
      #4  0xb7f77121 in std::uninitialized_copy<__gnu_cxx::__normal_iterator > >, brook::Pass*> () from /usr/lib/libbrook.so
      #5  0xb7f77153 in std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator > >, brook::Pass*, brook::Pass> () from /usr/lib/libbrook.so
      #6  0xb7f77249 in std::vector >::vector () from /usr/lib/libbrook.so
      #7  0xb7f712b4 in KernelImpl::run () from /usr/lib/libbrook.so
      #8  0xb7f7dd8d in brook::Kernel::run () from /usr/lib/libbrook.so
      #9  0x08052248 in __blit_bgr24::operator() ()
      #10 0x0804cd38 in Bulber::work_thread (arg=0xa122ab8) at Bulber.cpp:170
      #11 0xb7b204ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
      #12 0xb7c1749e in clone () from /lib/tls/i686/cmov/libc.so.6

       

      When another worker thread is doing the exact same kernel invocation, and can have the exact same callstack as above, or a slight variation like this:

      #0  0xb7fd4991 in ?? () from /lib/ld-linux.so.2
      #1  0xb7fc6f27 in ?? () from /lib/ld-linux.so.2
      #2  0xb7fc72bf in ?? () from /lib/ld-linux.so.2
      #3  0xb7fcbe7b in ?? () from /lib/ld-linux.so.2
      #4  0xb7fd19b0 in ?? () from /lib/ld-linux.so.2
      #5  0xb7f811d1 in brook::DefaultHandler () from /usr/lib/libbrook.so
      #6  0xb7d86f5a in operator new () from /usr/lib/libstdc++.so.6
      #7  0xb7f76d8e in __gnu_cxx::new_allocator::allocate ()
         from /usr/lib/libbrook.so
      #8  0xb7f76dc2 in std::_Vector_base >::_M_allocate () from /usr/lib/libbrook.so
      #9  0xb7f76dff in std::_Vector_base >::_Vector_base () from /usr/lib/libbrook.so
      #10 0xb7f771a3 in std::vector >::vector () from /usr/lib/libbrook.so
      #11 0xb7f712b4 in KernelImpl::run () from /usr/lib/libbrook.so
      #12 0xb7f7dd8d in brook::Kernel::run () from /usr/lib/libbrook.so
      #13 0x08052248 in __blit_bgr24::operator() ()
      #14 0x0804cd38 in Bulber::work_thread (arg=0xa122b98) at Bulber.cpp:170
      #15 0xb7b204ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
      #16 0xb7c1749e in clone () from /lib/tls/i686/cmov/libc.so.6

      My system is quad-core, and without digging to deeply into the brook+ sources, it seems that some handling of the "Pass" classes isn't 100% thread safe.

      I could try to upgrade to 2.0 - but I haven't seen any Brook+ tickets specific to this issue, so I won't get my hopes up.

      There is an easy workaround, I can just use a mutex to protect all my kernel invocations. But if they are not supposed to be thread safe, it would be more elegant to make them thread safe by having a mutex inside KernelImpl::run()