I have a few computers who's GPUs I use intensely, sometimes a card will overheat and go 'offline'. For this I would like to develop a watchdog service that will restart the computer and send me a notification. The problem is that the command 'clinfo' will still show the card as available and the driver endpoints '/sys/class/drm/card(0-5)/device/hwmon/hwmon(0-5)/temp1_crit' and 'temp1_input' are still available. I've tried looking around those directories for output that would indicate a card is offline.
Is there anyway I can surmise this information from the driver or from the system as a whole?