15 Replies Latest reply on Feb 6, 2012 4:02 PM by notzed

    Driver crashes -- how to avoid?

    pwvdendr

      I'm doing OpenCL computations and I often stumble upon the following behaviour: after running for a few seconds, especially when doing long computations per kernel, my Catalyst Control Center reports that "display driver stopped responding and has recovered". How can I avoid this? Perhaps relevant to mention that my screen usually freezes during the run (can't even chat/browse/type), but I'm not sure if much can be done about this (or if it is relevant at all).

        • Re: Driver crashes -- how to avoid?
          nathan1986

          This problem is often caused by device memory out-of-border or incorrect barrier(some threads can't go to that place, the kernel will wait forever). A debugging method is to comment part of the code  by turns to see the exact part which caused this crash.(sometime the OpenCL compiler optimizes the code because commenting, that using -cl-opt-disable in the build option is needed.)

          BTW: setting the display waiting time in register can break out from the dead kernel, so it can avoid rebooting the machine.

           

          [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]

          "DxgKrnlVersion"=dword:00002005

          "TdrDelay"=dword:00000040

            • Re: Driver crashes -- how to avoid?
              notyou

              nathan1986 wrote:

               

              This problem is often caused by device memory out-of-border or incorrect barrier(some threads can't go to that place, the kernel will wait forever). A debugging method is to comment part of the code  by turns to see the exact part which caused this crash.(sometime the OpenCL compiler optimizes the code because commenting, that using -cl-opt-disable in the build option is needed.)

              This doesn't seem like a memory location access problem. To me it sounds like a problem with the watchdog timer, especially since the CCC responds that it recovered. As far as I know, you can disable the timer so that it doesn't time out, but this is dangerous unless you know your program works and still won't make the computer usable while the program is running. The best solution I know of is to make sure your kernel runs for less time (1-2s at most, preferably less) and instead use more kernel enqueues. This way, control is handed back to the CPU for often so the timer doesn't kick in, but it still doesn't completely fix your freezing issue, it just makes it workable.

              1 of 1 people found this helpful
                • Re: Driver crashes -- how to avoid?
                  pwvdendr

                  : This doesn't seem like a bug to me, it happens in many different scripts, even standard samples if I greatly increase the input parameters.

                   

                  : Limiting the kernel time is indeed a workaround, but this is not a good solution, since there is often a serious overhead per kernel. For example, the need to initialize a random generator in every kernel when doing monte carlo simulations. If that takes 0.1s, limiting the kernel to 1s gives a serious overhead.

                  Moreover, when trying to write portable code, 1s is not a valid criterion. Speed differs drastically between computers and there is no timer function in the OpenCL kernel to control this.

                    • Re: Driver crashes -- how to avoid?
                      notyou

                      pwvdendr wrote:

                       

                      : Limiting the kernel time is indeed a workaround, but this is not a good solution, since there is often a serious overhead per kernel. For example, the need to initialize a random generator in every kernel when doing monte carlo simulations. If that takes 0.1s, limiting the kernel to 1s gives a serious overhead.

                      Moreover, when trying to write portable code, 1s is not a valid criterion. Speed differs drastically between computers and there is no timer function in the OpenCL kernel to control this.

                      Absolutely. This is by no means how it should be corrected but given that, in order to keep the computer usable during computation, right now, we have to make workarounds like this because of how control is given.

                      • Re: Driver crashes -- how to avoid?
                        MicahVillmow

                        pwvdendr,

                        There is not much you can do if your compute device is the display device. Graphics cards are currently not preemptive, so if your display device and compute device are equal, windows will reset the card every N seconds in order to update the screen. The only viable solution for a product is to make sure that your kernel finishes within N seconds(I believe N is 2 on standard window installs). If your display is using a different device, or is not hardware accelerated, I do not believe this will be an issue.

                  • Re: Driver crashes -- how to avoid?
                    laobrasuca

                    It seems that the new architecture (GNC on HD7900 series) makes this possible (I mean, your screen will not freeze while you use GPGPU).

                     

                    Otherwise, you can also do something like this http://msdn.microsoft.com/en-us/windows/hardware/gg487368 so that instead of reducing kernel time you make windows be a bit more patient regarding the timeout detection (you maybe need to restart computer). Of coarse, your screen will be frozen during kernel execution, but at least it will not kill your program.

                     

                    lao