5 Replies Latest reply on Apr 26, 2011 4:45 PM by rick.weber

    Weird bug causes implosion of entire universe

    rick.weber

      When running my application, I get the following error:

       

      ../../../cpucommand.cpp:260: ShouldNotReachHere()

      Here is the stack trace:
      #0  0x00007ffff6b8cba5 in raise () from /lib/libc.so.6
      #1  0x00007ffff6b906b0 in abort () from /lib/libc.so.6
      #2  0x00007ffff46ef919 in ?? ()
         from /nfs/sw/opencl/ati/current/lib/x86_64/libamdocl64.so
      #3  0x00007ffff469ab81 in ?? ()
         from /nfs/sw/opencl/ati/current/lib/x86_64/libamdocl64.so
      #4  0x00007ffff469b493 in ?? ()
         from /nfs/sw/opencl/ati/current/lib/x86_64/libamdocl64.so
      #5  0x00007ffff46ef08c in ?? ()
         from /nfs/sw/opencl/ati/current/lib/x86_64/libamdocl64.so
      #6  0x00007ffff46ed19d in ?? ()
         from /nfs/sw/opencl/ati/current/lib/x86_64/libamdocl64.so
      #7  0x00007ffff6943971 in start_thread () from /lib/libpthread.so.0
      #8  0x00007ffff6c3f92d in clone () from /lib/libc.so.6
      #9  0x0000000000000000 in ?? ()
      A repro is kinda hard to supply because I don't know what causes it...


        • Weird bug causes implosion of entire universe
          himanshu.gautam

          Please provide your system configuration:CPU,GPU,SDK,Driver,OS.

          Have you tried debugging your code using GDB?

            • Weird bug causes implosion of entire universe
              rick.weber

               

              Originally posted by: himanshu.gautam Please provide your system configuration:CPU,GPU,SDK,Driver,OS.

               

              Have you tried debugging your code using GDB?

               

              I'm using an Intel processor with SDK 2.4, which I know is an unsupported configuration, but I'm pretty sure the error will appear on an AMD processor as well. I'm using Ubuntu 10.10.

              Using GDB and commenting stuff out, I found this error is caused by a single barrier() call after doing a massive loop. In fact, if I comment out any of the barriers in my main loop, the code runs to completion (though, not necessarily correctly). I'm working on a repro to try to isolate the cause.

            • Weird bug causes implosion of entire universe
              MicahVillmow
              rick.weber,
              Without the code I can't say for sure, but this error means that your barrier is uneven. Basically that means that not every thread in the work-group is hitting the barrier which is required by the OpenCL spec.
                • Weird bug causes implosion of entire universe
                  rick.weber

                  I'd buy that. The looping code is very complicated with lots of branching and such. I tried to ensure that each thread enters every loop and conditionally executes real code while hitting every barrier, but I think I might have missed a few. Thanks Micah!

                   

                  In 2.5, would it be possible to detect this problem and report it less ambiguously?

                    • Weird bug causes implosion of entire universe
                      rick.weber

                      So, this is definitely the issue. I replaced all my barriers with __BARRIER(type) where:

                      #define __BARRIER(type) \
                        printf("\t__BARRIER hit by thread %d\n", get_local_id(0));\
                        barrier(type);

                      The expected printed output should be
                      __BARRIER hit by thread 0
                      __BARRIER hit by thread 1
                      __BARRIER hit by thread ...
                      __BARRIER hit by thread 63
                      for each barrier.
                      I mostly see this, but I also see some instances of:
                      __BARRIER hit by thread 0
                      __BARRIER hit by thread 32
                      __BARRIER hit by thread 0
                      __BARRIER hit by thread 32
                      which are incorrect.