cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

fesc2000
Adept I

Display driver hangs

Hi,

my OpenCL application regularly causes the display driver to hang and getting restarted (under Windows 7/64bit, catalyst 11.11).

 I'm wondering whether anyone has had the same experience and what to do about it (or how to debug it). Maybe there are some errors or constructs which are known to cause a driver crash (although i would always consider such a behaviour as driver bug ..).

The same application runs fine under Linux.

Thanks..

 

0 Likes
7 Replies
antzrhere
Adept III

Are you certain it's not a case of the display timing out? On windows by default if the kernel runs for >5 seconds the display adapter is reset. It may just behave differently on Linux?

0 Likes

That might be the case, windows just tells me it didn't react any more.

On the other hand, my kernels shouldn't take that long, there are no loops etc.

Is there a way to increase that timeout value?

0 Likes

To disable timout set:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\TdrLevel



to 0 for no timeout or 3 to restore default functionality (this is quoted for Vista from AMD SDK 2.5 release notes).

The fact that you have no loops or barriers suggests this is not your problem. Does the screen black out for a second? - if so this would suggest a timeout problem as the device is being reset. If not it may be a fatal error on the part of the compiler - I've had this problem once with a kernel executed on the CPU - the program window went white whilst compiling and the program crashed - but it worked fine on the GPU. I guess it was a bug in the AMD OpenCL compiler as everything up until build program executed correctly. Have you tried your code on the CPU?

0 Likes

When the error happens the application and desktop (except the mouse) freezes and gets restarted after some seconds.

Setting TdrLevel to 0 results in a complete freeze.

Using the CPU wouldn't work, because the application gets too slow. The error happens after the application has been running for a while.

Maybe i'm writing out of a buffer, although i though i'd taken care of this, but i think i have a 2nd look. Is there a defined behaviour when doing this, or could this cause a stalled kernel/GPU?

0 Likes

Sounds like your kernel is getting stuck and causing the problem, hence why when you disable the timeout in windows it freezes completely.

I've never found that by writing out memory can cause a kernel to hang (without loops), however if you accidently spill into some other part of memory that is being used this could cause undefined results. As the GPU is abit of a black box I suppose anything could be possible.

Apart from loops and thread synchronisaton barriers I can't think of what else can cause a hang.

Could you post your kernel and any associated code? 

0 Likes

It was indeed an out-of-bound write. I added a boundary check into one kernel and it runs stable now ...

Valgrind for OpenCL would be nice 🙂

Thanks for the help!

0 Likes

fesc2000,
Most likely this is the watchdog timer causing a reset of your GPU on windows. Since a GPU is not a pre-emptible device, windows just resets it if a thread uses all of the resources of the graphics card for longer than a set period of time.
0 Likes