I've been troubleshooting for the past 3-4 days what appeared at start as an OS/Hardware issue, but ended up to appear a Graphics card/ Software issue.
I've got an XFX RX 570 RS- 8 GB , installed in a Rog Strix 570-F Gaming + a 5900x CPU , and a 750W Seasonic PSU, all running on a WIn11
Since a few days ago I keep on getting sudden restarts - only after by a glitch whenever some load is being put on the GPU, in games.
Better description of the issue :
System was stable, Windows Updates on and up to date, Radeon driver gave me headaches since I had the card , nearly 2 years. However it's only been at an app level, and never functionality , so I didn't bother.
Since 3-4 days ago, not sure really what happened, be it a Windows Update (Which I can't find in the installed updates) , or the new Radeon driver, but each time I open a game and try to actually play it, I get random restarts and occasionally the WHEA ID 18 Event log.
Steps in the attempt to remediate the issue :
Check Windows for updates
Scan for viruses
Install separate instance of WIn 10(on another Drive)
Rest Windows 11
Format all OS drives, install clean Win11
Format all OS drives, clean install Win10
MULTIPLE Radeon Drivers ( WQHL or not, even the Pro driver was tested)
Bios Update for MOBO
Bios flash for GPU, Reverted to Stock Bios.
Remove 2 sticks of Ram (then the other 2)
Try with different PSU (Even though the Seasonic one is brand new, about 1-2 weeks old)
Stress test CPU
Stress test GPU
Stress test Memory
Regedit to add\modify TDRDelay
Safe mode + DDU (display driver uninstaller) then reinstalled again the GPU + Chipset driver.
Played with UEFI settings for OC and TPM multiple times then reverted to default (as suggested on many forums)
Checked the dumps with "WhoCrashed" (logged under C:\Windows\LiveKernelReports\WATCHDOG) , and got only suggestions as per below: A livedump triggered by dxgkrnl occurred. You may have problems with your graphics driver or hardware.
The crash took place in a Microsoft module. Your system configuration may be incorrect. Possibly this problem is caused by another driver on your system that cannot be identified at this time.
Absolutely nothing from the above allowed me to game. the only instance where I could run the game was right after I removed all the drivers with DDU in safe mode. After restart, I got the generic driver, I was able to run a game in a really low resolution, but I obviously couldn't actually play the game. I didn't get any crash then.
Games attempted vary so it's not isolated to a single title, but what I've seen is that if I play a 2D game it seems to let me play it . (example is Salt and Sanctuary, from Epic) I am not 100% it's stable in 2D games, as I only tried it for a couple of minutes, which is more than enough to get any other game to restart, if not even the system to be back at the login screen.
I am really out of ideas, except going to a Linux Distro and trying some Epic games in there, however I would really love to get some AMD support over this, given that the prices for medium specs GPU's go way above the price I paid for my CPU . (and given the recent experience where I can only blame the Radeon stuff, I would stay away from it)
I am sorry for the very long message( I can assure you it's way shorter than the hours put into investigating this, and the lack of sleep for it) but I tried to describe as best as possible all the things I did.
There's a long list of problems that could cause WHEA Event ID 18.
Regarding dxgkml, follow this guide - https://www.partitionwizard.com/disk-recovery/dxgkrnl-sys.html
I have a feeling it's a DirectX issue as mentioned above, or bad RAM (configuration).
Wouldn't the DirectX issue be sorted by getting a clean install of either Win10 or Win11? The thing is that I get different errors (above are the most recent from the 3 days of intensive troubleshooting and testing) ,but each time with a different driver installed.
Given that in terms of hardware, I have excluded pretty much anything by doing the stress test + being able as of now to stress test the machine in Linux Mint, but also, by the fact that the Generic Graphics driver is not failing, surely this must be related to something outside of my computer specs, so while WHEA refers -in theory- to hardware, I am 99% sure it's Radeon.
RAM is impossible to be at fault. Simply because, as I said, a few days before this, everything was working perfectly fine with exactly the same in the case. NO changes in Bios, RAM not reseated for a few good months(since install)
Both pages have been looked at already, I've gone past the point of google searching each error in part ,and I'm hoping I'll at least a way to discuss/report this to AMD but I either can't find that easy way or I'm too tired of them and not paying attention.
Last thing I can\will probably end up doing is getting an old GPU installed just to confirm if Radeon is a pile of sh1t or not .
Thanks for the effort anyway!
and another thing that I forgot to mention, sometimes I am getting the message that the Radeon Wattman Settings Have Been Restored, right after the crash & reboot .
Funny fact is that I don't get the wattman settings in the Radeon App.
And also funny, I am currently on version 21.10.2, from 5/10/2021, and it's saying up to date. Windows wants to give me another driver, and at certain points Radeon is actually suggesting I update to the most recent version of Radeon Adrenalin, which is an optional. But not always, and not right now , right now I am up to date.
Clearly this looks like the crappy app& driver are playing with me...
One thing I didn't see you talk about is your PSU... these can cause hardware instability and are often overlooked. I only learned from experience that a PSU I had was causing stability issue when my GPU was really pushing it's limits and was otherwise rock solid when the system load was normal. Swapping out the PSU for another one solved the issue for me.
PSU is a Seasonic 700W.
Tried with another PSU which I know is working fine.
The only "change" to the system is a Seasonic case+Psu. The one upside down,Q7.
And yes,tried with the mobo out and in the "correct" position,with both PSU'S
just to update this tread following very long efforts,I have excluded absolutely all components,except video card.
Swapped Gpu with a 1050 TI, gamed for about 1 hour.
No crashes or other AMD nonsense.
Added my card to the source pc of the Gpu- known as working.
Same issues,fresh driver.
Therefore,I am left with 2 possible scenarios:
Faulty GPU all of a sudden
Faulty Radeon/Windows drivers.
Given that in benchmark apps I am not getting this issue,I would go with the 2nd option.
Will try warranty repair/replace/refund,however the money I would get for the card today would allow me to buy a low end GDDR5 card with today's prices.
Thanks Amd. I am leaving you, officially in terms of GPU'S.
Glad you at least figured it out. I'm not sure how old your GPU is, but I don't think AMD really could have helped you out here. They don't have any formal call center for troubleshooting cards. That's mostly in the realm of the board manufacturers and others giving troubleshooting help from the internet.
I know there are a few AMD branded cards that they made directly for the retail segment, but that is really less than 1% of all AMD GPUs sold.
Enjoy the GeForce Experience software. It's so much better than the Radeon GPU software ecosystem. I switched to nVidia GPUs back in 2019 and am so much happier with the software support nVidia hardware has.