cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pwvdendr
Adept II

Driver crashes -- how to avoid?

I'm doing OpenCL computations and I often stumble upon the following behaviour: after running for a few seconds, especially when doing long computations per kernel, my Catalyst Control Center reports that "display driver stopped responding and has recovered". How can I avoid this? Perhaps relevant to mention that my screen usually freezes during the run (can't even chat/browse/type), but I'm not sure if much can be done about this (or if it is relevant at all).

0 Likes
15 Replies
nathan1986
Adept II

This problem is often caused by device memory out-of-border or incorrect barrier(some threads can't go to that place, the kernel will wait forever). A debugging method is to comment part of the code  by turns to see the exact part which caused this crash.(sometime the OpenCL compiler optimizes the code because commenting, that using -cl-opt-disable in the build option is needed.)

BTW: setting the display waiting time in register can break out from the dead kernel, so it can avoid rebooting the machine.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]

"DxgKrnlVersion"=dword:00002005

"TdrDelay"=dword:00000040

0 Likes

nathan1986 wrote:

This problem is often caused by device memory out-of-border or incorrect barrier(some threads can't go to that place, the kernel will wait forever). A debugging method is to comment part of the code  by turns to see the exact part which caused this crash.(sometime the OpenCL compiler optimizes the code because commenting, that using -cl-opt-disable in the build option is needed.)

This doesn't seem like a memory location access problem. To me it sounds like a problem with the watchdog timer, especially since the CCC responds that it recovered. As far as I know, you can disable the timer so that it doesn't time out, but this is dangerous unless you know your program works and still won't make the computer usable while the program is running. The best solution I know of is to make sure your kernel runs for less time (1-2s at most, preferably less) and instead use more kernel enqueues. This way, control is handed back to the CPU for often so the timer doesn't kick in, but it still doesn't completely fix your freezing issue, it just makes it workable.

: This doesn't seem like a bug to me, it happens in many different scripts, even standard samples if I greatly increase the input parameters.

: Limiting the kernel time is indeed a workaround, but this is not a good solution, since there is often a serious overhead per kernel. For example, the need to initialize a random generator in every kernel when doing monte carlo simulations. If that takes 0.1s, limiting the kernel to 1s gives a serious overhead.

Moreover, when trying to write portable code, 1s is not a valid criterion. Speed differs drastically between computers and there is no timer function in the OpenCL kernel to control this.

0 Likes

pwvdendr wrote:

: Limiting the kernel time is indeed a workaround, but this is not a good solution, since there is often a serious overhead per kernel. For example, the need to initialize a random generator in every kernel when doing monte carlo simulations. If that takes 0.1s, limiting the kernel to 1s gives a serious overhead.

Moreover, when trying to write portable code, 1s is not a valid criterion. Speed differs drastically between computers and there is no timer function in the OpenCL kernel to control this.

Absolutely. This is by no means how it should be corrected but given that, in order to keep the computer usable during computation, right now, we have to make workarounds like this because of how control is given.

0 Likes

pwvdendr,

There is not much you can do if your compute device is the display device. Graphics cards are currently not preemptive, so if your display device and compute device are equal, windows will reset the card every N seconds in order to update the screen. The only viable solution for a product is to make sure that your kernel finishes within N seconds(I believe N is 2 on standard window installs). If your display is using a different device, or is not hardware accelerated, I do not believe this will be an issue.

0 Likes

MicahVillmow wrote:

If your display is using a different device, or is not hardware accelerated, I do not believe this will be an issue.

So then that's probably the way to go. I tried putting my screen cable on the motherboard port, but that resulted in no signal when booting (and hence a black screen). I'll have to look for an extra card thus, I assume.

0 Likes

No signal: try turning your onboard video on in bios.

0 Likes
laobrasuca
Journeyman III

It seems that the new architecture (GNC on HD7900 series) makes this possible (I mean, your screen will not freeze while you use GPGPU).

Otherwise, you can also do something like this http://msdn.microsoft.com/en-us/windows/hardware/gg487368 so that instead of reducing kernel time you make windows be a bit more patient regarding the timeout detection (you maybe need to restart computer). Of coarse, your screen will be frozen during kernel execution, but at least it will not kill your program.

lao

0 Likes

laobrasuca wrote:

It seems that the new architecture (GNC on HD7900 series) makes this possible (I mean, your screen will not freeze while you use GPGPU).

Oh really? That's interesting, since I was planning to buy a 7970 or 7990. Can you point me to a reference with more information?

0 Likes

you have this talk http://vimeo.com/32966732 from Eric Demers introducing GNC (from AMD Fusion Summit 2011). In the begining, he does a breaf survey on the ATI/AMD architecture evolution over the years (associated with games evolution). From the 26th minutes on, however, is there where he will talk about GNC. And at the minute 33 he talks about the graphics card treating several applications, GUI, big jobs, using priority queues and stuff in order to guarantee a nice quality of service. There is where he mentions "there's no more skiping mouse when you do a big job, because the big jog is runing on a separete queue". Now, is this dependent of a new API, or is this all treated by the hardware in a transparent way, I don't know.

0 Likes

laobrasuca wrote:

It seems that the new architecture (GNC on HD7900 series) makes this possible (I mean, your screen will not freeze while you use GPGPU).

This is what I thought to but it doesn't seem to be happening.

0 Likes

I was guessing/hoping I'm still seeing this due to immature drivers

0 Likes

This feature is not enabled in our drivers yet, it will be enabled in a future catalyst version.

0 Likes

Aha, that explains why I find so few documentation. Do you mean "a future" as in "probably the next, in a month or less" or rather as in "somewhere in the distant future, no idea about timing yet"?

0 Likes

Obviously this was a goal of the hardware design, but it's still good to hear it will eventually be making it into the drivers.

0 Likes