I built a PC over the summer using all new parts and have been having the issue described above since pretty much the start, I have tried a few things/ just put up with it but it hasn't helped. There isn't a regular time this happens and isn't limited to a single use, happens in CPU intensive games, GPU intensive and even whilst running crypto mining software. Another potential useful detail is the power led indicators on the GPU all turn off when this happens. Very rarely I may get a blue screen with the thread stuck in device driver error message if that gives any indication of the issue.
My specs are:
Ryzen 5 3600 - Stock cooler
MSI Vega 64 Air Boost
Corsair Vengeance LPX 2x8GB 3000MHz C15 RAM
Intel 660p 1TB NVMe SSD - Boot Drive
MSI B450 Gaming Plus
Corsair CX750M 750W Bronze PSU - Grey label version
Some of the things I have tried to try fix/ find the cause are (in no particular order, just trying to remember them all):
Update/roll back graphics drivers including using DDU and safe mode for this
Increase/decrease power limit of the GPU using wattman and afterburner
Update/roll back BIOS (limited to the first BIOS that supports Ryzen 3000)
Increase/decrease RAM frequency/timings
Ran memtest86 multiple times with no errors
Ran prime95 for multiple hours with no issues
Moved GPU to different PCIe slot
Checked temperatures during use and nothing abnormal
Took side panel off and turned a fan on to cool parts more just in case
Used sfc and dism commands to check windows files (at first there were corruptions but now it comes up clean and issue persists)
Used windows install media to repair instillation
Checked PSU voltages using BIOS/HWiNFO64 and all seem within reason
Checked event viewer for any details with the only issue being the one related to windows not shutting down properly due to having to hold the power button to turn it off
Tried daisy chain and separate power cables to GPU
Inspected parts for physical damage
Cleaned dust on parts
Ensured parts seated correctly
The performance is as expected outside of this issue, there is no physical warning of this crash and due to some of the things listed above, I don't think it is a hardware issue but can't rule it out. I don't have any spare hardware to try out and I need the GPU to get a display output so it has to stay in the system. I would prefer to avoid getting replacement parts due to being at university and needing my computer but if push comes to shove it is possible as a last resort.
I hope that gives a good overview of my issue and any and all help would be appreciated. If any additional tests can be done or more information needed just let me know and I will see what I can do.
Quick update if anyone actually cares about it:
Completely fresh installing Windows 10 and running games results in the exact same issues previously mentioned.
Literally anything is useful at this point. Please help.
Both oldest and newest Ryzen 3000 compatible BIOS updates have the same issue for me , so far the closest I have been to a solution is using the professional drivers. They do lack some features of the newest adrenaline drivers however. I am also not sure about how performance is affected. I am currently testing them but so far, they seem at least 'more' stable.
In CoD MW the professional drivers actually seem to make it worse but in literally every other game/computing load, they seem more stable. My bet is it is infinity ward's issue.
Did you go to WattMan and try to reduce GPU core frequency levels by 100 Mhz (or 50 Mhz) after reset the WattMan settings?
Is the system stable for the OCCT PSU test?
Did you try to close the Radeon software running in the background and increase the chances of the system to remain stable?
I used AMD GPU with a very similar PSU and lowering the GPU core frequency solved my similar problems. You may be having a problem with PSU quality like me.
I ran the OCCT PSU test and everything appears to be stable with no errors.
My suspicions are that it may be a driver issue, or the card is faulty. I would rather avoid down clocking the card as that is not what I paid for and I still have warranty on it but I do need my computer and I don't have another graphics card available as of now. So far the Radeon™ Pro Software for Enterprise drivers seem to be working better but I will try reducing clock speeds if these start messing up too. Blue screens with THREAD_STUCK_IN_DEVICE_DRIVER make me think it is not a PSU issue but at the same time it doesn't always blue screen.
If the PSU test with OCCT is successful, you do not have a hardware problem. I think you think wrong. It may sound silly to you, but you won't lose anything by trying that... Reduce maximum core clock speeds by 100-200Mhz for both CPU and GPU. I think this is the best solution to reduce stabilization problems. It is also worth trying the system with a good PSU. How old is your current PSU?
My PSU was purchased new along with the rest of my components about 6 months ago. Currently I have only crashed once in over a week using the pro drivers, but I'm convinced it was the application's fault. If I switch back to adrenaline drivers I'll try lowering the clocks as you suggested, as I may need to use adrenaline for the things pro drivers aren't supported. I don't want to touch the settings right now just in case it breaks again. I'll get back to you if I do try that.