17 Replies Latest reply on Nov 5, 2010 7:18 AM by eklund.n

    Weird lockups when using GPU device on linux

    gat3way

      Hello all,

      I experience some very weird hard lockups when running my OpenCL kernel on GPU. On CPU it runs correctly and gives out the correct results. On GPU sometimes it runs correctly while sometimes it completely hangs the system to a point that it requires a reboot. Strange thing is that this apparently is dependent on the global_work_size and sometimes with a larger global work size, the program does not crash, while with smaller global work size it hangs.

      The kernel overall does one read and five writes from/to global memory. The rest consists of arithmetic/bitwise operations on local uint4s (I mean __private ones I believe).

      Since there is no SKA for linux and my debugging abilities are limited, I tried commenting out code to find out what exactly causes the problem, however the behavior is rather erratic and global worksize-dependent. Basically the hangs occur once a certain number of bitwise/arithmetics are performed on local variables.

      My grid is 1-dimensional one and I provide NULL as local_work_size parameter so that OpenCL should choose the most appropriate value depending on the registers pressure and stuff like that. My theory is that for some reason, the OpenCL implementation does not properly calculate the register usage, thus the local work size is not being calculated properly and that leads to hard lockups (?!?).

      Anyway, indeed I solved the problem by providing a hardcoded local_work_size value (chosen so that the global_work_size is divisible by that). Of course, performance dropped by about 20-30% due to that, yet that's acceptable to me.

      I am using Radeon HD4670 and I am wondering whether this is OpenCL-related issue or hardware one. I will buy a 6870 card in the next 1-2 weeks and will do some testing to see whether this could be reproduced on that hardware as well.

      I can post the kernel code, it's just about 100-200 lines, however the host code needed to properly setup all the parameters is much more than that. I can try to write a simplified test case though.

       

        • Weird lockups when using GPU device on linux
          cjang

          Try waiting several minutes (up to ten) to see if your system becomes responsive again. I have encountered similar symptoms as you describe. The system hangs on specific specializations of a parameterized kernel model/template. As I use auto-tuning, the application is searching over thousands of different kernels and runs into this.

          In my particular case, this is a driver issue. It also happens with memory buffer based kernels. When using images, it never happens. Your speculation about register usage has some merit as array subscript arithmetic uses registers.

          If you can get back into your system, check the kernel log. You may see a message indicating that the driver hung. I'm not sure what sort of watchdog causes this to timeout (I run Ubuntu 10.04 x86_64). But it does and then I can ssh into and use the system normally, although the X server and GPU/driver is now in a bad state (appears hung). This is where having a headless system may be an advantage. You may have to switch to a different virtual console and login again in order to do anything.

          To give more background, a year ago with the now very old SDK v2.0 / Catalyst 9.12, this failure mode never happened. It started for me with SDK v2.1 / Catalyst 10.4. However, performance jumped immediately by 20% to 30%. There was a trade of some stability for higher performance.

          Another thing is that the 20-30% difference you see can likely be recovered with careful tuning. There are probably other nearby kernel specializations in the design space that reach the same peak without stepping over the limit and causing a failure. This is my experience from auto-tuning.

            • Weird lockups when using GPU device on linux
              gat3way

              Hello and thanks for your reply.

              I've tried waiting for several minutes, but the system did not recover. I was unable to ssh to the machine (I did not try whether it responds to ICMP ping though - next time I would try). Browsing  /var/log/syslog, I see no kernel panics. I haven't checked the Xorg log for weird errors though.

              I am using Catalyst 10.9 and SDK2.2.

              It looks like 4670 does not support images 

                • Weird lockups when using GPU device on linux
                  cjang

                  Yes, AFAIK, images require the Evergreen 5xxx architecture. It does not work on older GPUs. Another thing - you mention acquiring a 6870 card soon. You may wish to get a 5870 as it supports double precision.

                  The hangs I experienced were completely deterministic and repeatable. At first, I was tuning by hand and kept notes of the kernel parameters that caused the system to hang. This was not scalable so then I had to implement a memo which kept track of all kernel parameters, good and bad, automatically. In this way, it became practical to tune kernels.

                  It's not an ideal situation. But my guess is that you can find working kernels with the extra 20-30% performance if you do enough tuning. The kernels are "out there". You just have to find them and then know where they are.

                    • Weird lockups when using GPU device on linux
                      gat3way

                      Double precision is good, but I don't need it as all I do is 32-bit arithmetic/bitwise ops. In fact, I don't use any floating point stuff at all.

                      I think part of the problem is that we don't have something like SKA on linux. It all boils down to trials and errors and this sucks. To all AMD people: we really need profiling tools on linux....

                      BTW another thing is that while testing my program with valgrind, I see lots of issues in the cal/opencl libraries, mostly related to out-of-bounds memory reads/writes. Upgrading my libstdc++ miraculously eliminated some of those, but still I see those.

                • Weird lockups when using GPU device on linux
                  saleel

                  How many GPR's are you using in your kernel?