I find myself occasionally finding kernels will lock up the entire machine. This is normal with other vendors but they will eventually crash the display adapter after a certain amount of time. It was my understanding that others had the inverse of this problem where long calculations would time out. Is it possible that a fix for their problem is causing this?
The difference between reseting the card, crashing the driver versus hanging the display is one of a infinite loop/long running program versus a live-lock/dead-lock on the card. Because a GPU is not a pre-emptible, if your program causes the card to lock up, then there is no way to reset it. While your display is no longer available, you can still ssh into the machine.
Fair enough for the display card but this happens even if code is not being run on the main display adapter. For example if I have two AMD cards one driving the display and one just doing compute.
Programs timing out is not a problem at all (it has very flexible workarounds). But when the device crashes, you have to reboot the machine (I often ssh into the box and reboot it through that...it takes few minutes to reboot due to stuck process).
I guess this is why the AMD products are not used often for actual gpgpu computing... Because rebooting a cluster would be an undesirable feature.
I vote for timeout over crash Hopefully AMD would fix this issue before they loose GPGPU computing to computers...