I've had my ASUS ROG Strix Vega 64 since January 2018 and just recently (August/September 2019 maybe?) begun having issues with both of my monitors blacking out, audio seems to still be playing, but hitting caps/num locks on the keyboard eventually stop toggling the identifier light and I hard reboot the system. The fans on the GPU will also spin a few hundred RPM faster than normal occasionally which seems to coincide with a crash unless I underclock the GPU.
This used to happen on my ASUS Crosshair VI Hero & Ryzen 7 1800X system and still occurs on my new ASUS Crosshair VIII Hero & Ryzen 9 3950X system. Below are any BIOS version info I could find for both motherboards. If there's any other information that might be helpful I'd gladly help out.
I've also attached a screenshot of the performance tab in the driver when the GPU fan spins up.
ASUS Crosshair VI Hero
BIOS Version: 6201 x64
BIOS Date: 2018-05-29
EC1 Version: MBEC-AM4-0310
EC2 Version: RGE2-AM4-0106
LED EC Version: LED-0116
KeyBot Version: KBOT2-ROG-0122
ASUS Crosshair VIII Hero
BIOS Version: 1201 x64
BIOS Date: 2019-11-18
EC Version: MBEC-X570-0218
LED EC1 Version: AULA3-6K75-0202
ASUS ROG Strix Vega 64 (From HWiNFO64)
Video BIO Version: 016.001.001.000
SMU Firmware Version: 5.28.20
Oh, forgot to mention that if I leave my PC off for a while and then turn it on to use, it doesn't seem to have any issues for quite a while. After prolonged use it will eventually black screen. However, if I reboot the machine and keep using, it seems to be much more likely to occur again unless I give the machine time to be off for a while before using it again.
It also seems to do better with the tuning underclocked and undervolted but resets my custom profile if I have to hard reset it. Though, I realize that both of these anecdotes aren't necessarily valid conclusions and could just be bias on my part attempting to notice a correlation. It does "seem like it helps!"
Well, I ran HWiNFO64 and logged all the sensors up until it crashed. Nothing stands out to me except the GPU current and power kinda jump around a lot? This morning GPU temp was in the mid/high 60's to low 70's with hotspot mid/high 90's. This evening, hotspot didn't even get to 90 before it crashed.
I'm waiting on some Thermal Grizzly Carbonaut pads and Minus 8 pads to replace on my GPU; should have them first week in March. If there's anything to try in the meantime please let me know. Thanks!
This is a very common problem with Vega 56 and 64 GPUs. Unfortunately, AMD nor any manufacturer will publicly admit fault.
It's not a heat or power problem, there is no solution except to RMA with your manufacturer or complain to AMD Support Ticket.
As an update (and reply to @doubleutf), I know someone who works at the AMD driver testing place and have been receiving input from him and have just been updating this post to have a record of what's going on. They have been trying to help me, though. If nothing here works I'll probably contact ASUS next to see what they suggest. However, I've had the card over two years so I'm not sure if I'll be able to RMA anything at this point but I'd give it a shot if necessary. I want to have as much info here for others who might be in the same boat, though.
I've replaced the thermal pads with Thermal Grizzly Minus 8 pads (as per lots of posts online). 3mm pads to replace the large yellow pad attached directly to the heatsink and a 2mm pad to replace the small grey pad on the metal shim. I've also replaced the thermal grease on the GPU die and HBM2 with a Thermal Grizzly Carbonaut pad (might replace this with Kryonaut paste though; it's hard to get positioned quite right).
I reset the driver to use the automatic profile instead of my super underclocked profile and haven't had any crashes. GPU, GPU HBM, GPU VR VDDC, and GPU VR MVDD Temperatures sit between 70-80 C under full load but the GPU Hot Spot temperature hovers in the 104-106 C range. But, haven't had any crashes; though I really don't like that hot spot temperature that high.
I haven't found any information as to what "hot spot" actually means, what the expected and safe range should be, or what to do. I might try swapping to paste and see if that helps.
So, got back from PAX and had a crash again so underclocked the GPU so hotspot stayed under 80C. Finally got around to swapping out the carbonaut pad for thermal paste. Running furmark, avg 200fps, and hotspot has been slowly creeping up from 77C after 2mins to 90C after ten. Still much better than the 106C immediately after starting I got previously. I'm going to keep an eye on it for a while but I'm optimistic that this hopefully addresses my issue.
Have you had anymore crashes? I bought my gigabyte vega 64 refurbished a couple years ago and have been fighting crashes since the beginning. I have never taken it apart but I will as a last resort.
hi guys dont know if will help you but i have a mis vega 56 oc and i have had the same probs you guys have so what i done was to lower my memory from 800mhz to 750 mhz and the voltage to 900ma , with the gpu frequency i turned down from 1649mhz to 1560mhz and the voltage down to 1100 ma and the power to -10 and changed to fan curve , no more crashes for me apart from the drivers 20.11.2 but the newest 20.12.1 but other drivers for me were cool , hope it helps :)