cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pwvdendr
Adept II

Driver crashes -- how to avoid?

I'm doing OpenCL computations and I often stumble upon the following behaviour: after running for a few seconds, especially when doing long computations per kernel, my Catalyst Control Center reports that "display driver stopped responding and has recovered". How can I avoid this? Perhaps relevant to mention that my screen usually freezes during the run (can't even chat/browse/type), but I'm not sure if much can be done about this (or if it is relevant at all).

0 Likes
15 Replies
nathan1986
Adept II

Re: Driver crashes -- how to avoid?

This problem is often caused by device memory out-of-border or incorrect barrier(some threads can't go to that place, the kernel will wait forever). A debugging method is to comment part of the code  by turns to see the exact part which caused this crash.(sometime the OpenCL compiler optimizes the code because commenting, that using -cl-opt-disable in the build option is needed.)

BTW: setting the display waiting time in register can break out from the dead kernel, so it can avoid rebooting the machine.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]

"DxgKrnlVersion"=dword:00002005

"TdrDelay"=dword:00000040

0 Likes
notyou
Adept III

Re: Driver crashes -- how to avoid?

nathan1986 wrote:

This problem is often caused by device memory out-of-border or incorrect barrier(some threads can't go to that place, the kernel will wait forever). A debugging method is to comment part of the code  by turns to see the exact part which caused this crash.(sometime the OpenCL compiler optimizes the code because commenting, that using -cl-opt-disable in the build option is needed.)

This doesn't seem like a memory location access problem. To me it sounds like a problem with the watchdog timer, especially since the CCC responds that it recovered. As far as I know, you can disable the timer so that it doesn't time out, but this is dangerous unless you know your program works and still won't make the computer usable while the program is running. The best solution I know of is to make sure your kernel runs for less time (1-2s at most, preferably less) and instead use more kernel enqueues. This way, control is handed back to the CPU for often so the timer doesn't kick in, but it still doesn't completely fix your freezing issue, it just makes it workable.

pwvdendr
Adept II

Re: Driver crashes -- how to avoid?

: This doesn't seem like a bug to me, it happens in many different scripts, even standard samples if I greatly increase the input parameters.

: Limiting the kernel time is indeed a workaround, but this is not a good solution, since there is often a serious overhead per kernel. For example, the need to initialize a random generator in every kernel when doing monte carlo simulations. If that takes 0.1s, limiting the kernel to 1s gives a serious overhead.

Moreover, when trying to write portable code, 1s is not a valid criterion. Speed differs drastically between computers and there is no timer function in the OpenCL kernel to control this.

0 Likes
notyou
Adept III

Re: Driver crashes -- how to avoid?

pwvdendr wrote:

: Limiting the kernel time is indeed a workaround, but this is not a good solution, since there is often a serious overhead per kernel. For example, the need to initialize a random generator in every kernel when doing monte carlo simulations. If that takes 0.1s, limiting the kernel to 1s gives a serious overhead.

Moreover, when trying to write portable code, 1s is not a valid criterion. Speed differs drastically between computers and there is no timer function in the OpenCL kernel to control this.

Absolutely. This is by no means how it should be corrected but given that, in order to keep the computer usable during computation, right now, we have to make workarounds like this because of how control is given.

0 Likes
MicahVillmow
Staff
Staff

Re: Driver crashes -- how to avoid?

pwvdendr,

There is not much you can do if your compute device is the display device. Graphics cards are currently not preemptive, so if your display device and compute device are equal, windows will reset the card every N seconds in order to update the screen. The only viable solution for a product is to make sure that your kernel finishes within N seconds(I believe N is 2 on standard window installs). If your display is using a different device, or is not hardware accelerated, I do not believe this will be an issue.

0 Likes
laobrasuca
Journeyman III

Re: Driver crashes -- how to avoid?

It seems that the new architecture (GNC on HD7900 series) makes this possible (I mean, your screen will not freeze while you use GPGPU).

Otherwise, you can also do something like this http://msdn.microsoft.com/en-us/windows/hardware/gg487368 so that instead of reducing kernel time you make windows be a bit more patient regarding the timeout detection (you maybe need to restart computer). Of coarse, your screen will be frozen during kernel execution, but at least it will not kill your program.

lao

0 Likes
pwvdendr
Adept II

Re: Driver crashes -- how to avoid?

laobrasuca wrote:

It seems that the new architecture (GNC on HD7900 series) makes this possible (I mean, your screen will not freeze while you use GPGPU).

Oh really? That's interesting, since I was planning to buy a 7970 or 7990. Can you point me to a reference with more information?

0 Likes
pwvdendr
Adept II

Re: Driver crashes -- how to avoid?

MicahVillmow wrote:

If your display is using a different device, or is not hardware accelerated, I do not believe this will be an issue.

So then that's probably the way to go. I tried putting my screen cable on the motherboard port, but that resulted in no signal when booting (and hence a black screen). I'll have to look for an extra card thus, I assume.

0 Likes
arsenm
Adept III

Re: Driver crashes -- how to avoid?

laobrasuca wrote:

It seems that the new architecture (GNC on HD7900 series) makes this possible (I mean, your screen will not freeze while you use GPGPU).

This is what I thought to but it doesn't seem to be happening.

0 Likes