I agree with you.
The "Typical Current Idle" is a workaround for one form of "freeze-while-idle". This appears to be a hardware issue, related to the CPU going into a deep sleep (C6) from which it cannot be woken. Let me call that "frozen-solid".
The "Typical Current Idle" is (I believe) provided by late model AGESA -- so is (I believe) an AMD provided workaround. AMD have not cared to explain what this option actually does or what problem it works around.
In "Erratum 1109" AMD admit to a problem with MWAIT which may be the root cause for "frozen-solid", for which there is "No fix planned".
There appear to be other "lockups" where the CPU is not "frozen-solid". In particular where some threads stall or stop, but not everything.
It could be that these are symptoms of "frozen-solid" at a Core or (hardware) Thread level... perhaps the OS has a (software) thread allocated to a (hardware) Thread which has "frozen-solid" and will not wake up when kicked... perhaps it is when all (hardware) Threads are "frozen-solid" that one sees the CPU "frozen-solid"... who can tell ? In which case, perhaps a wide range of "lockups" are, indeed, related.
However, the other "lockups" could be software or other hardware. In particular, I note "lockups" for which "Typical Current Idle" does not help. Also, I note "lockups" which "idle=nomwait" does not help. These feel to me like different problems... but I have no way of knowing.
AMD could cast some light on this. But I am not holding my breath.
[FWIW, the Kernel Maintainers are also notable for their total silence; which is another disappointment.]
I feel for you 😞
My experience is with the earlier Ryzen 1800X. With that I tried:
Only the last worked for me. Though with --c6-package-disable and rcu_nocbs, it froze after 11 days.
I have not tried idle=nomwait. My guess is that this is another way of avoiding going anywhere near C6.
AMD have specified that the PSU must be capable of maintaining 12V at 0A. The Aerocool KCAS 650 claims to be "Compliant with ATX12V Ver.2.4". ATX12V v2.4, section 3.2.10 seems to:
If AMD really mean 0A (and 0.05A is not good enough), then the Aerocool KCAS 650 may or may not satisfy the requirement. [But in my experience, 12V at 0A did not fix the problem.]
Since you are suffering even with "Typical Current Idle", I wonder if the problem is not the same. Also, I did not see "rcu_sched detected stalls..." and it was generally days before it froze.
In your position, I would put to one side all the C6, PSU etc. voodoo, and see whether the "rcu_sched detected stalls...", and any logging associated with that, can tell you anything more about what is going on.
Seems the instability RETURNED without anything changed on hardware / software / OS!
I had in the last 3 days frozen X screens about 4 times, with mouse cursor still alive. I needed my favorite sddm restart via SSH : sudo systemctl restart sddm to get back my X screen alive.
Also there is a NEW DISCOVERY CONTRARY to the past:
I could X-remote via ssh to see this on a remote PC when the Ryzen X hang and quite different from all the other times before, I can see CPU12 & 13 & 14 & 15 stuck @ ZERO% and CPU 10 & 11 just occasionally bumped to few percent and Zero most of the time. I observed this for 5 mins approx.
The update of my VPN re-connect difficulties is less suspected on Ryzen related - more likely ISP network / firewall issue.
I realized what recently caused setback to stability can be related to 2 of my several VMWARE virtual machine's kernels being updated automatically to the 4.18.0-13 & 4.18.0-14 which I had previously encountered BIG PROBLEMS, and solved after changing kernel [in the actual host machine's OS]. I am not able to change the OS kernels yet but for the time being I added idle=nomwait kernel parameters to their boot command lines and rebooted them - just for the time being. And will observe for a further period, and until I have time to change kernels.
I am not sure if this Ryzen Issue will affect Virtual Machines or not??
And if a virtual machine's thread hangs will the host also die??
I found today the AMD RYZEN ERRATA PDF, mentioning their various issues discussed here.
Disappointing to see their table listing these bugs as NO FIX PLANNED!!
Wondering if the AMD μProf, tool would help here ??
They say it can do:
Could it be so that the remote profiling can be done, so that we can get some logs ?. But then also not sure if the system would be idle enough, with the profiler processes running, to reproduce the problem.
Don't we have some JTAG like interface to hardware debug in Ryzen CPUs ? I believe modern Intel CPUS can be hardware debugged through a USB 3 port.
I think I made my most significant discovery and conclusion so far and like to share here:
All the while I have focused limitedly on the physical Ryzen computer, which is a server running up to 8 vmware virtual machines. What I had not thought of was the virtual machines themselves!
Today I suddenly realized that even the mwait instructions running inside the virtual machines will freeze up threads, this includes ALL virtual machines! including those emulating 32 bit CPUs using my Ryzen. The only slight difference is the vmware virtualization layer seemed to mitigate the freezing effect to NOT completely lockup - that it's virtualization layer will still take away the physical CPU from the frozen threads inside the virtual machines.
This is why I had partially sticky / unresponsive characteristics and not entirely dead - I could still SSH into this physical server and do my soft reset via command sudo systemctl restart sddm on the server and get things back on track and going. My Ryzen vm server itself had it's own kernel startup parameter set to idle=nomwait already, that is why it is not too dead by itself, and I could still ssh into it!
I have still got crazy problem all these while because the virtual machines threads had gone frozen!
I had also confirmed now that my BIOS version F4e of Gigabyte GA-X470 prevented the C6 powerstate lockup, which I believe if happened to me will only be able to unfreeze by motherboard reset key. I had since BIOS update not been in such horrible bad state.
My Linux virtual machines are now added with idle=nomwait kernel parameters to boot. Stability of the overall system improved further.
However, I am not reached to bottom of all my troubles yet, because I have 3 non-Linux virtual machines which are BSDs. I am uncertain yet if h idle=nomwait kernel parameters are applicable to BSDs, I need to read documentations further.
So it explains a little about my issue of VPN re-connection difficulties - the VPN server is BSD, when it got connected, there are periodical ping every few mins to keep link alive hence prevented the mwait instruction freezing the thread in some ways.But once they got disconnected and idle the mwait instruction freezing occurred, and thus, re-connection became impossible.
You post game me an idea for further testing. I've been struggling with the freezes for the last several weeks. Now they do not interrupt my workflow, but I can still trigger them if I want to.
If I install 18.10, everything seem to work fine with idle=halt, but if I install VirtulBox with 18.04 and start compiling curl in a loop, the HOST will freeze after a couple of hours. Even if the host is Windows 10, I can still freeze it with 18.04 running in VirtualBox, even with idle=halt as a kernel parameter for guest.
This behavior happens on 4( ! ) different machines, including my home PC with Ryzen 1700, not 2700x as three others.
Throwing my hat into this issue. Got a cheap refurbed HP 580-137c off of woot, so BIOS updates may not be as frequent and BIOS will definitely will have limited customizable stuff.
The last 2 (kernel argument & zenstates) I just did after dealing with random lockups every day or two while computer is mostly idle. Hopefully the zenstates thing pans out.
Before making above changes, I could not get system to stay up for more than two days. I am 8 days in since doing the above.
Big thank you to all who contributed to this thread