I have some bad issues with some of my kernels (Debian Linux, x86, Catalyst 12.4, AMD Radeon HD5870) . I occasionally get program hangs while my program is running. It becomes a zombie process, X cannot be restarted, fglrx module cannot be unloaded, system needs reboot. dmesg reveals some "ASIC Hang" errors.
First thing that came to my mind was that this is a hardware error. So far I have eliminated the following possible causes:
* Bad PSU
* Bad GPU (replaced it with another 5870, still have the same problem)
What's common between the kernels that have the issue is that they are extremely ALU-intensive, involve a loop of several thousand iterations, GPR usage is relatively high (between 40 and 80 GPRs). Decreasing the NDRange seems to minimize the problem and with low enough global work size, this problem does not manifest anymore. However this is not acceptable as the occupancy suffers a lot and overall performance drops by more than 50%.
What is more interesting is that I used to run the same code on the same machine with older driver version (Catalyst 11.8 I believe) and did not have those issues.
By the way, is there some way to control that driver watchdog timeout? If there was, I believe just increasing it would probably solve those issues.
This is what happens when your kernel has some error, such as writing out of bounds of local memory or conditionally barriering. On Linux X takes all the CPU and you must reboot to fix it. On Windows the driver resets.
Yes, that would be expected, however I doubt that's my case. None of the kernels use barriers and some of them do not even use local memory. Besides, crashes are not immediate, they occur after some time (say 20-30 minutes, but that varies). I cannot reproduce them reliably, sometimes they occur, sometimes not, even when working on the same dataset.
Try to run on different devices. Perhaps on CPU for testing. If there is a bug like going out of bounds etc. perhaps it woulld cause the program to calculate different results at each run. So check your results.
II am not sure if the watchdog timeout is an issue. I ran kernels which take over 600 seconds in linux. It just works for me...