I have some bad issues with some of my kernels (Debian Linux, x86, Catalyst 12.4, AMD Radeon HD5870) . I occasionally get program hangs while my program is running. It becomes a zombie process, X cannot be restarted, fglrx module cannot be unloaded, system needs reboot. dmesg reveals some "ASIC Hang" errors.
First thing that came to my mind was that this is a hardware error. So far I have eliminated the following possible causes:
* Bad PSU
* Bad GPU (replaced it with another 5870, still have the same problem)
What's common between the kernels that have the issue is that they are extremely ALU-intensive, involve a loop of several thousand iterations, GPR usage is relatively high (between 40 and 80 GPRs). Decreasing the NDRange seems to minimize the problem and with low enough global work size, this problem does not manifest anymore. However this is not acceptable as the occupancy suffers a lot and overall performance drops by more than 50%.
What is more interesting is that I used to run the same code on the same machine with older driver version (Catalyst 11.8 I believe) and did not have those issues.
By the way, is there some way to control that driver watchdog timeout? If there was, I believe just increasing it would probably solve those issues.