First things first: I know this used to be a common problem. I spent a ton of time researching different cases, but mine seems different.
I got a Limited Edition Vega 64 card with a used PC I bought several months ago. It behaved really well at the beginning, but then the system would occasionally get stuck at a 100% GPU load when playing Witcher 3. Exiting the game or restarting the drivers wouldn't help. Even if nothing else is running and task manager shows 0% GPU usage - the will be running at a 100% and will not stop until the PC is restarted. The problem is that it would also develop color artifacts on the screen when it happens. Then for a little bit this problem went away, until it resurfaced a few weeks ago.
For the last 3 weeks or so the card has been extremely unstable. It would hit a 100% and develop artifacts almost every time I run CSGO. Then it started crashing and resetting the WattMan settings. Now if you see artifacts - it means the card will probably reset itself in a few seconds and both monitors will turn off for a couple seconds, before turning back on.
The biggest problem for me is that recently it started crashing not only in games, but in Solidworks and Keyshot that I have to use for work, which has become completely unacceptable. Sometimes it would suddenly lock to a 100% and start the artifact disco when the only open applications are Chrome or Netflix.
I tried different WattMan settings, including increasing power limit and power save. I tried every driver version between August and December. It is not a mining trojan and It does not seem to be a ReLive issue like in some cases I've seen online. The PSU is pretty new and definitely using two cables to power the GPU, so I would rule it out too.
I am trying to figure out if there is anything i can do aside from spending a bunch more money on a new, inferior GPU (i got a really good deal on this PC as a whole, but i would definitely not be able to buy another high-end card). It seems like it is deteriorating but what could be the issue? Failing HBM? Why does it still lock to a 100%? Can the card be repaired? Is it worth anything as is?
nothing else is running, task manager shows 0% GPU usage,
You could try flashing a new BIOS. You can contact the card's manufacturer to see if they'll give you one. Or you can try to find a newer one matching your hardware here.
Apart from that, to avoid having to restart the PC, you might try resetting the driver only using the devcon utility. First type this, to make sure your card is seen:
devcon /find *DEV_687F*
Assuming it is, you can create a batch file to restart the driver, like this:
devcon /disable *DEV_687F*
timeout /t 3 /nobreak
devcon /enable *DEV_687F*
That will disable the display driver, wait three seconds, then re-enable the driver. That may or may not reset the problem state of your card.