I am looking for a solution for a datacenter with AMD Gpus that I can use to detect if cards are frozen/ crashed and be able to reset them.
Sometimes I think the card is still functioning but doesn't output a display and sometimes it is fully dead so I would like to be able to detect either.
Something similar to Nvidias DCGM ?
Or is there a good way to write this manually, for Linux and Windows.