cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

gat3way
Journeyman III

ASIC Hang issues

Hello,

I have some bad issues with some of my kernels (Debian Linux, x86, Catalyst 12.4, AMD Radeon HD5870) . I occasionally get program hangs while my program is running. It becomes a zombie process, X cannot be restarted, fglrx module cannot be unloaded, system needs reboot. dmesg reveals some "ASIC Hang" errors.

First thing that came to my mind was that this is a hardware error. So far I have eliminated the following possible causes:

* Overheating

* Bad PSU

* Bad GPU (replaced it with another 5870, still have the same problem)

What's common between the kernels that have the issue is that they are extremely ALU-intensive, involve a loop of several thousand iterations, GPR usage is relatively high (between 40 and 80 GPRs). Decreasing the NDRange seems to minimize the problem and with low enough global work size, this problem does not manifest anymore. However this is not acceptable as the occupancy suffers a lot and overall performance drops by more than 50%.

What is more interesting is that I used to run the same code on the same machine with older driver version (Catalyst 11.8 I believe) and did not have those issues.

By the way, is there some way to control that driver watchdog timeout? If there was, I believe just increasing it would probably solve those issues.

0 Likes
4 Replies
rihont
Journeyman III

anyone knows?

0 Likes
arsenm
Adept III

This is what happens when your kernel has some error, such as writing out of bounds of local memory or conditionally barriering. On Linux X takes all the CPU and you must reboot to fix it. On Windows the driver resets.

0 Likes

Yes, that would be expected, however I doubt that's my case. None of the kernels use barriers and some of them do not even use local memory. Besides, crashes are not immediate, they occur after some time (say 20-30 minutes, but that varies). I cannot reproduce them reliably, sometimes they occur, sometimes not, even when working on the same dataset.

0 Likes

Try to run on different devices. Perhaps on CPU for testing. If there is a bug like going out of bounds etc. perhaps it woulld cause the program to calculate different results at each run. So check your results.

II am not sure if the watchdog timeout is an issue. I ran kernels which take over 600 seconds in linux. It just works for me...

0 Likes