First of all, apologies for the length of this. I'm hoping that a) this gives enough information to help anyone determine what is going on, and b) anyone who comes across this knows what to look at to see if their issue is the same...
As I write this, I'm working with a sluggish system that takes approximately 15 seconds just to switch tabs in my browser, and I can type a sentence before the screen updates (yes, I have a GTX 960, and yes it's old and unplayable with any games but I refuse in principle to spend $3k on a GPU). I can drag a window around my screen in circles, let go of the mouse, and watch it continue dragging the screen around for another 5 seconds. I have NEVER seen such strange behaviour until a few weeks ago, starting with some of the supposed "fixes" for Windows 11
To see that I'm not crazy, see a screenshot from Ryzen Master of a system supposedly running at 212W and sitting at 30C, whilst drawing 225A (EDC), yet peaking at 610MHz:
This occurs randomly. The system has been tuned extensively with PBO and CO per-core (over MANY months, crashes, BSOD, random idle reboots - a very frustrating experience really, CO isn't nearly as "automatic" as AMD would have you believe, as vdroop instability is a massive problem, particularly at idle or lightly-loaded single thread loads), but it WAS achieving solid scores using my LianLi Galahad 360 AIO and passing CoreCycler, OCCT, Prime95, Cinebench, etc. without issue. Not my best scores, but this is my daily-driver setup:
So... what happened? No clue. Nada. I can't pinpoint when it happened exactly, but it seemed to be after one of many W11 patch cycles plus several days later, and only later did I discover a new AMD chipset driver after seeing random BSOD and insane behaviour in Ryzen Master (it was reporting my SOC running at 999W). So I couldn't narrow it down to an issue with the outdated driver, W11 or some other issue. But I'm tending to lean towards a combination of the lot.
I suspect it's sociability issues between the SMU, BIOS (perhaps even an AGESA bug), W11 scheduler and the chipset driver. All I know is that the system is fine for days, or sometimes just hours, then it does this. I've had it happen under load, whilst idle, whilst gaming in old games (GTX 960, remember those?). When it hits, the ONLY fix is shutdown, then power on. A restart will not fix the issue, which to me suggests the bug is in the BIOS or AGESA... or even the CPU itself.
It seems somehow related to the boosting logic, as I haven't seen the issue when PBO is disabled, but the system is just sluggish as I do a lot of single core work too. I was using Dynamic OC (unique to this mainboard), but have it currently disabled in case it was related to the issue. It made no difference, issue still appeared.
How can I narrow it down? No idea, I know of no practical way to query the "source of truth", being the SMU. I have an Asus ROG Crosshair VIII Dark Hero, running the latest BIOS. It also has it's own Nuvoton Embedded Controller, which is confirming the data reported by the CPU is just simply, wrong (note the Avg. 7A compared to Ryzen Master 225A):
The voltages on the system are normal, and the C-states show healthy behaviour meaning the CPU is shifting cores between various ACPI levels - so it's not like the system is "stuck" in any abnormal state here due to a scheduler bug:
The only abnormality seems to be in ANY data reporting from the CPU itself. Note especially that the CPU cores all report nominal power usage, but the SVI2 TFN values on "Core Current" does not match with the mainboard sensors, especially given the current clocks and VID/Vcore/Temps. Note also that PROCHOT etc. are not flagged, and I have confirmed there is no mounting issue with the AIO on the core:
This system is mainly a development system, which frequently hosts middleware and front-end websites in virtuals (WSL2+Docker mostly), and it had never skipped a beat. W11 introduced major improvements to WSL2, otherwise I would have not even bothered. I play occasional old games, but gave up with anything modern as playing in 1280x720 and getting 18 FPS is just not a fun experience (neither is spending $3k for a midrange GPU either!).
I'm going to do my usual power cycle, and this will be working normally again.
Has anyone got any idea at all on how to narrow down what is causing this? Or any other tests I can run the next time it occurs?
I have exactly the same problem and can't pinpoint the cause either.
I have 3600 also with Asus ROG Crosshair VIII Dark Hero. I've tried everything, from downgrading BIOS to adjusting PBO values. Nothing. One thing I noticed, with aggressive OC the problem appears more frequently, with low PPT (65W) I can go for days without it, but eventually it comes back.
From what I understand reading about similar symptoms, this is probably a faulty temp sensor on the motherboard which gives incorrect temp reading and causes the CPU to throttle. RMAing the board usually fixes the problem, so that's what I'm going to do.
Fascinating to see another Dark Hero with the same issue... let me know if the RMA helps with the issue, I'm hesitating on an warranty claim as my system is used very heavily and I can't afford the downtime right now.
I'd personally be surprised if it was a bad temp sensor on the board (unless there is a design flaw that results in random errors or noise creating bad values?), mainly because board temp sensors are generally less critical than those inbuilt to the CPU etc. Especially given the 5950x has sensors everywhere internally, I'd be more inclined to believe a possible issue with the firmware reading those sensors - given how frequently they are polled and the heavy dependency on PBO curves.
I don't go crazy with my OC's (other than fixing 1:1 mClk-fClk), and enabling dynamic OC at fairly modest values that don't BSOD, but I did fine tune my CO curves very aggressively. The result has been stable and consistent (every core is different from the next, with CCD0 driven way lower on the curves (some to -30) than CCD1 (the best could go to -10, most others to -5). It's been rock solid until recently, which again is suggesting something more likely to do with the Windows scheduler changes. After the chipset driver upgrade, it seemed to improve - but I'm getting hard locks where I never did before with the memory - so no idea what to believe any more.
I'm curious... are you on W11 too?
I ended up not RMAing the motherboard, since I kind of found a workaround. I'm monitoring my temps with both AIDA64 and Argus Monitor and I noticed that my VRM and water temperature sensors sometimes give a reading of N/A, and Argus sometimes gives me a warning that my CPU temperature exceeds 85C limit, although it was nowhere near that. That confirmed my suspicion that something is going on with temp sensors.
What I did, I changed CPU Power Duty Control (Extreme Tweaker - External Digi+ Power Control) in BIOS from T. Probe to Extreme, and it solved this problem. Although my system sometimes still behaves weirdly, I haven't encountered this issue since.
For future reference and for the sake of a better troubleshooting, here are some more details on my setup:
- Motherboard: ASUS X570 Strix Gaming-E
- CPU: Ryzen 9 5900X
- RAM: G.Skill 64GB F4-3600C16Q-64GTZNC
- PSU: Corsair RM1000i
- GPU: MSI RTX 2080 Super Gaming Trio X
- OS: Windows 10 64-bit - Version 21H2 - Build 19044.1503
- BIOS version: 4021
- AMD Chipset version: 184.108.40.2066
- NVIDIA Driver version: 511.23
- Corsair iCue version: 4.19.191
- MSI Dragon Center version: 220.127.116.11
- Argus monitor version: 6.0.05
- AI Suite 3 version: 3.01.10
- Armoury Crate version: 18.104.22.168
- MSI Afterburner vesion: 4.6.4
- Razer Synapse version: 3.6.1215.121004
Things I've tried that failed:
- Change Power Plan from Balanced to High Performance
- Editing Minimum and Maximum processor state in power plan
- Reinstalling ASUS MB Drivers / Utilities
- Reinstalling AMD chipset (22.214.171.1246)
- Reinstalling Corsair iCue and MSI Dragon Center
- Restoring Win 10 to the last working restore point
- Reinstalling NVIDIA Drivers
- Reseating RAM Modules
- Reapplying Thermal paste and reseating CPU
Yesterday I tried the following:
- Using both 8 pin and 4 pin for CPU power.
- Deactivating 3600 MHZ profile for my RAM modules. They now run with 2133 MHZ.
- Everything else is running stock settings in BIOS. No OC.
For now, it still runs with normal clock speed. I'm monitoring with Hwinfo.
I will provide an update as soon anything changes. Most likely it will fail sooner or later.
I've attached some screenshots of the readouts from Hwinfo and Ryzen Master recorded while clock speed is stuck at 0.5 Ghz. Look at the Power Report Deviation. That's nuts.
Seems i am having the same exact problem though a few different pieces of hardware.
GPU - EVGA 3090 KINGPIN Hydro Copper
CPU - 5950x with Optimus AM4 block
MOBO - ASUS x570 Crosshair VIII Formula
RAM - GSkill 32gb (4x8gb) 3600 cl16 / Corsair Dominator 32gb (4x8gb) 3600 CL16
PSU - Corsair RMx 1000w
OS - WIN 10 Pro
BIOS - 3904 and 3801
Issue started about a month ago, took awhile and a lot of complete tear downs of my entire rig to finally notice that my PPT, EDC, and TDC values were maxed with PBO on while idle. If i turn PBO off, im capped at 547mhz and they are still capped. Some days itll run fine, and others itll crash 5 times in 30 mins with a PSU reset required each time. I have tried Clear CMOS, different power plans, reverting BIOS and Chipsets, and reinstalled WIN 10 Pro and all drivers from scratch. Problem remains. I have NOT tried the T Probe to Extreme change listed earlier and i might try that next, but for now i am just lost and pissed. This is too expensive of a rig and i am too paranoid to deal with this ha. Really weird though thats its all ASUS boards of varying models. Leads me to believe that something in a BIOS or Chipset is causing these issues. I use Ryzen Master, HWINFO, OCCT, and AIDA64 to monitor and confirm that its readings are happening wrong in all programs.
iCUE yes, im at work atm, but it is for sure the most up to date version because after windows reinstall i just went to their download page and got it again.
I do NOT have Argus Master, i use AIDA64, maybe they have the same flaw? I am not sure. Last night, I might have made a rash decision and just bought a new 5950x and motherboard, just in case. Wife wont be happy ha. Just not opening them until i troubleshoot this completely. I guess i could disable icue and AIDA64 an see if something there is causing the problem. Just gonna be tough to monitor without my sensor panel
Rumors have it, that iCUE or Argus may be the cause.
Regarding iCUE specifically a roll back to the version 4.15.153 should fix it.
I've uninstalled both at the moment to see if it changes anything.
I would not be surprised honestly lol. I am not sure if I had installed iCUE before starting tests though after rebuilding computer. Tonight when i get home, i will uninstall iCUE and close AIDA64 from starting at startup and see if that works. If it doesnt, then i have no idea. The only reason i run iCUE anyways is to use my g-key on keyboard to mute my Roccat Torch Mic with streamdeck. My keyboard and mouse are hardware profiles that do not need iCUE running or installed to maintain their macros and programming.
After reverting iCUE have you noticed anything else or had any issues in furthre testing? I disabled iCUE last night but the problem still happens. If i leave everything on default in BIOS, the CPU almost immediately is at 547mhz, and i cannot raise it at all. Ryzen Master shows default TDC, EDC, and PPT however its at like 150% of the default value. If i enable PBO, it has the higher default values but CPU Clocks are normal for awhile yet still crashes under testing/gaming
I haven't had any issues yet.
Have you tried resetting CMOS after reverting and/or uninstalling iCUE?
I have a feeling that CMOS is saving some bogus readings and that it has to be reset before further testing can be done.
Yes, pretty much anytime it hangs/crashes, i clear CMOS to ensure BIOS settings are back to normal. I am going to try and mess with AIDA sensors tonight and iCUE uninstall and see. Hopefully itll work
I have never had Argus installed, and iCUE is still installed on my computer, i just disabled it and closed out of the application. Tonight i will go and do a full uninstall and try to cancel/delete appdata folders and services. So we shall see, I really hope its something stupid like this. I tried the previous chipset driver mentioned earlier in the thread but yeah, still happens
I seem to have the same issue.
Processor: Ryzen 5900x (no OC)
Mainboard: ASUS ROG Strix X570-E
I have 2 variations of the problem:
1. SoC Current is 90A and never varies. I don't really notice it under idle but as soon as there is some load the frequency tanks because the processor wants to stay within the PPT limit.
2. CPU Core Current and CPU EDC show 200+A (I have seen up to 250A), CPU PPT shows 200+W, I have seen Power Reporting Deviations of 6000%+ and the Clocks go to around ~546 MHz. My PC uses 110 W (measured externally) at that point.
Both Errors occur randomly after a while (if I'm unlucky after a few minutes, if I'm lucky after several hours). If the first Error occurs it developes into the second one after some time. To reliable temporarily fix it I shutdown the PC and cut the power for a while. A simple restart does not always work.
I have tried updating and reinstalling the chipset-driver, disabling the C6-state, setting minimum power limit.
It's interesting that all mainboards here seem to be from ASUS. But maybe they are just popular.
I resolved the issue by rolling back to Windows 10 and disabling fTPM in BIOS. Rollback to Windows 10 by itself did not resolve the issue until fTPM was disabled.
As an update to this, I still very, very rarely get the issue still even with Win10+fTPM disabled, however it has only occurred once in a couple of weeks whereas before it was basically any time I gamed without fail.
I don't have iCUE at all.
I'm absolutely losing my mind about this. I've swapped the motherboard twice and the chip once, and the same behavior (high ppt reading at 400W + edc at 130 percent of 200a reoccurs). In my case, only a full flash or restore of the bios . The odd part is that since the last chip swap, i get throttled to 4.12ghz instead of .54, which would be hard to notice if I weren't already looking for the crazy ryzen power number.
I'm pretty that the numbers are spurious since pushing 400w through the chip seems likely to have some unmissable side effect, but it also doesn't seem to be a bad sensor since the failure state read consistently across 3 motherboards.
Honestly not sure what direction to take in debugging this.
Just wanted to provide an update to my situation.
I had tried restoring WIN 10 to an older version, and running BIOS at stock to no avail.
I reinstalled corsair iCUE and reverted back to v4.15 from last summer, and now everything is working perfectly fine. There must be some problem between iCUE, Windows, and Chipset that is causing the MOBO or CPU to think the CPU is extremely taxed and its downclocking to adjust for the load/power.
I'm glad to hear it helped. Now the big question is, do we dare to update iCUE next time an update is rolled out
Guess we'll just have to wait and see.
Also, for the people with ASUS x570 Strix Gaming-e board, there's a new beta bios out. Don't know if I'll install it just yet.
When corsair released v4.20 icue last week I updated to it, though maybe they identified and fixed. It didn't. I'll probably update next time and just keep the version 4.15 saved to revert to just in case
Try changing Windows/Ryzen Power Plans and see if it makes a difference. Also configure your CPU Minimum State to 5% and Maximum to 100%.
I believe the Ryzen or Windows Power Plans automatically has the Minimum Processor State around 95% or higher.
This recent AMD thread had a similar issue with very high PPT: https://community.amd.com/t5/processors/very-high-ppt-soc-power-amp-low-cpu-frequency-3900x-win11/m-...
Thanks @elstaci I have actually tried playing with the power plans, but it made no difference in my case - including adjusting the minimums. However, the other link you mentioned is an identical issue to mine - very fascinating to see. Again, W11 and an Asus board...
Does anyone have any tools to query the SMC directly to see what it *thinks* it is seeing? I really wonder if this is a firmware issue, but without querying the SMC directly - I can't determine if it's W11, Chipset driver or SMC/BIOS...