cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

imshalla
Adept II

Ryzen linux kernel bug 196683 - Random Soft Lockup

I have Ryzen 7 1800X, ASUS Prime X370-PRO, running Fedora 26 and 27.

The damn thing has not worked properly since I bought it.

My first CPU was RMA'd, and the replacement does not appear to suffer from the SEGV fault.

However, both the original and the replacement CPU both crash regularly:

  * occasionally with streams of "watchdog: BUG: soft lockup" events being logged,

  * but mostly the system just stops and I can find no logging that tells me why.

At bugzilla.kernel.org I find bug 196683, where a "workaround" is suggested:

1) kernel configured:  CONFIG_RCU_NOCB_CPU=y

2) kernel command-line:  rcu-nocbs=0-15

But with kernel 4.14.18-300.fc27 I find the machine has stopped over night (when it is idle), every two or three days.

I have added kernel command-line "processor.max_cstate=5", which may help with the crashes, but (I assume) not with the electricity bill :-(

Does anybody understand what the real fault is ?  A "workaround" is all very well, but not entirely satisfactory.  It's not as if this is a new device any more.

0 Likes
122 Replies
simonsaysthis
Journeyman III

I have tried pretty much everything. Ubuntu 4.15 LTS Kernel & 4.17 on Fedora. None of the above methods really eliminates the problem for me. I still get random freezes on reboot or after PC being idle overnight. Few months back I changed from Ryzen 5 2400g to Ryzen 5 1600x, hoping the older model without onboard graphics would be more Linux friendly.

Overall pretty put off by AMD now, there shouldn't be all this manual tweaking just for the basics.

0 Likes

What's your hardware configuration? Are you running the latest BIOS?

My machine was constantly freezing and I was very unhappy with my purchase. Now it's running perfectly. All I had to change was the "Typical Current Idle" setting in BIOS.

Asus A320m-k + Ryzen 5-1600x Have updated the BIOS to latest available and changed to "typical" profile as recommended. I still have ACPI issues especially when rebooting or waking machine. Either the fix doesn't work for everyone or Asus has been writing poor firmware code as of recent.

0 Likes

Hi,

Well that is not that easy as you may think it is.

You should really think about the situation for the most board venrods right now.

They do *NOT* support Linux .. meaning the BIOS is written for Windows , the defaults

are meant for Windows , whatever custom interfaces are meant for Windows.

But that is NOT AMD .. is your board vendor decides what go in into the BIOS and how,

also what the defaults are and they decides what firmware they use.

The RYZEN/TR AMD platform is still new and so like for every new platform

OS integrators and developers need some time to implement features and drivers

in all OS'es. And like always we need be patient.

BR

0 Likes

When I bought my Ryzen 7 1800X in mid March 2017, it was brand new.

In mid Nov 2017 I finally lost patience, and bought an i7-8700K, which was also at the time brand new.  Having transferred everything that really mattered to me to the new machine, I RMA'd the Ryzen 7.

Both machines run a current Fedora and have ASUS motherboards.  There are no prizes for guessing which one has worked reliably from day one.

It is no doubt true that each board vendor is responsible for their BIOS.  However, whatever they support is limited by the support they get from the CPU vendor.  The "Typical Current Idle" option is a feature of the AGESA (see AGESA - Wikipedia) software component, provided by AMD.  The various board vendors have updated their BIOSes at different times.  As far as I can see, the first Ryzen 7 CPUs were almost a year old at the time AMD released the version of AGESA with the "Typical Current Idle" option.

From a Linux perspective, it seems a shame that the "Typical Current Idle" seems not to be the default BIOS setting.  You may well be correct that this is because Windows has some other way of avoiding whatever the underlying problem is, so the default is fine for Windows.

It seems to me that for Linux the "Typical Current Idle" option is required if you wish to avoid the "freeze when idle" problem.  It's not obvious though, is it ?

I cannot find anything which tells me what the "Typical Current Idle" option actually does, or whether the Linux Kernel could implement same independently.

If you buy a Ryzen 7 today to run Linux, you are in for a surprise the first time you get a "freeze when idle".  Sadly, the freeze looks a lot like a hardware issue, so you may waste your time fiddling with memory etc. (particularly if you are overclocking !)  Eventually, you may stumble across this or other threads and discover the "Typical Current Idle" option.  You may first stumble across other, earlier speculation and suggested work-arounds, which don't work.  <sigh>

Who knows whether the more recent AMD CPUs do or do not suffer from similar problems ?

In response to a support ticket, AMD said first (a) that they recommend a PSU which supports 0A draw at 12V, and later (b) that I should try the "Typical Current Idle" option (when my motherboard supported it).  I did upgrade to a more modern and more efficient PSU, which the manufacturer was happy to confirm supported 0A draw on all voltage rails.  The new PSU was an improvement, but had no discernible effect on the "freeze when idle" problem.

Otherwise, AMD, the motherboard vendors and the Linux maintainers all remain silent on the topic -- correct me if I am wrong here.  The "freeze when idle" problem is irritating.  The silence is infuriating.

But, as you say, patience is a virtue.  The next time I think of buying a brand new AMD CPU, I will remember that I ought to wait some time before expecting it to be usable.  For the Ryzen 7 it seems I should have waited a little over a year.  Once bitten, twice shy they say, so next time perhaps I should wait two years or so.  Or longer.  Maybe much longer.

imshalla​, appreciate the well articulated post

The silence around this issue is quite strange!

I used to wonder..that if AMD lets the cat out of the bag, then it could be subjected to do a massive recall. just like what it did for the "Performance marginality problem".

Then, why does it not show up so evidently on Windows ?.. Perhaps they worked with Microsoft & issued a patch, that is either proper or they made the windows kernel put a little load on the CPU, but then reported zero load to the kernel apis that ...say the Task Manager would invoke.or it could be such that the way windows works with Ryzen , would not push the CPU to this lock-up situation..

Still, all of these thoughts are just guesses.

Sometimes, I used to wonder if we ( just a handful), are the only ones who are facing this problem.

I am also so surprised to see that Micheal Larabel of Phoronix, who plays around 24/7 with Linux has not seemed to have faced this issue. (Note, that AFAIK, he was one of the few who popularized the performance marginality problem, which the eventually elicited a response from AMD)

Also wondering why we don't see any Thread Ripper or EPYC owners commenting about it here..

469376-Mark-Twain-Quote-The-truth-hurts-but-silence-kills.jpg

shinobi wrote:

I am also so surprised to see that Micheal Larabel of Phoronix, who plays around 24/7 with Linux has not seemed to have faced this issue

Perhaps the result of a golden sample.

But you are right, we have no insight in the statistics.

Are the problems mostly in the first Ryzen serie, or are they also present with the 2600/2700?

And is the Typical Current Idle the solution or not, is it good implemented on the earlier motherboards, or does it only work well in the latest serie (450/470) ?

Still hoping that the problem has disapeared with the latest hardware.

But we do not know the statistics, how many 2600/2700 owners are using Linux and how many of them have the problem .

I think AMD has some of the answers, but they are complete silent, and that worries me.

0 Likes

It does not just seem like a hardware problem.

I am using gentoo on AMD Ryzen 5 1600 Six-Core Processor. So I can play with kernels without being dependent on distrubutions.

The strange thing is: There are linux kernels that work without CMD-Line commands and without "Typical Current Idle".

For example 4.17.14 works wonderfully. Or from the git-sources 4.19_rc2. Really good. Without parameters, with default bios setting. Nice.

Other kernels are awful. 4.18.x can only be used with bios settings or CMD line options.

But in 4.18 the entire power management was redone. Take a look at "Linux Weather Forecast".

At the moment 4.18.7 is marked as stable. I am testing today and have no results yet.

T.

0 Likes

Yeah, the diversity of Linux distro's and having almost always a different kernel versions is not really helping in this case.

Problems and solutions are given for X (hardware), Y (used distro) which have a different Z (kernel).

And Z has usual specific compile options for Y, but also could or should have options for X.

Difficult to manage when something is not working with new X.

For sure when you are not a kernel specialist.

One thing is sure, it does not always work out of the box.

And I'm still thinking that AMD must take the leading light and guide us out of the misery

0 Likes

Here is a recent discussion between Jon Bach from Puget Systems(stability oriented system builder) & Wendell (Linux guy) from Level1Techs.

They talk a bit about stability & the wobbling progress in CPU power management, in general, in 2018.

LIVE: Level1Techs with Jon Bach of Puget Systems! - YouTube

0 Likes

I've had the same problem described here with a Ryzen 5 1600 on an Asrock X470 Gaming K4 mainboard, microcode update 800F11/8001137. The system stopped while idle every one or two days. I set the "power supply idle control" to "typical current idle" 20 days ago and no hang since. So, this fixed it.

What I cannot confirm is that some Linux kernels work better than others. I've tried all kernels that came with Debian/testing from 4.14 to 4.18.6 and they all hung.

0 Likes
lights_a5
Journeyman III

Came here to say that I as well have similar problems and think this bug is the problem.

I'm running Arch Linux with Gnome desktop.

Hardware:

Motherboard: ASUS X370-Pro

Processor: Ryzen 7 1700

Graphics Card: RX 480

Kernel: Linux 4.18.9

I updated the BIOS and tried the suggested work around to setting the power supply idle control to typical. A day later I got a hang. Less than 3 hours uptime and I was idle for like 10 minutes. I came back, moved my mouse to a text field on Firefox and it hung.

I am very disappointed in this.

0 Likes

Can you, please, check using the zenstates.py script if "C6 State - Package" really is disabled after you changed the power supply idle control in the BIOS?

If it's indeed disabled, then you seem to me to be the first one speaking up here whose system still hangs when idle. For all others, this worked.

0 Likes

I ran zenstates and it appears to be enabled still. I went ahead and used the script to disable the C6 State. I guess we will see what happens now.

0 Likes

What the "Typical Current Idle" option does is secret :-(

On my machine (Ryzen 7 1800X, Asus X370-Pro, BIOS 4012), zenstates.py tells me:

  Low Current Idle    : C6 Package Enabled   : C6 Core Enabled

  Typical Current Idle: C6 Package Disabled  : C6 Core Enabled

  Auto                : C6 Package Enabled   : C6 Core Enabled

and all three have the same three P-States:

  P0: FID=90 DID=8 VID=20 Ratio=36.00 vCore=1.35000

  P1: FID=80 DID=8 VID=2C Ratio=32.00 vCore=1.27500

  P2: FID=84 DID=C VID=68 Ratio=22.00 vCore=0.90000

[I thought that "Typical Current Idle" might be fiddling with P2, but that does not seem to be the case.]

I imagine there are many more parameters I could look at, if only I knew more. For completeness, I leave all other BIOS options in their default state.

While I was checking the effect of the "Current Idle" options, I noticed something peculiar about setting "Typical Current Idle". I started with:

0)                       "Typical Current Idle"   was: C6 Package Disabled

and then:

1) reboot into BIOS, set "Low Current Idle",     gave: C6 Package Enabled

2) reboot into BIOS, set "Typical Current Idle", gave: C6 Package Enabled !

3) shutdown and restart,                         gave: C6 Package Disabled !

So... after a cold boot, you may find that "Typical Current Idle" has more effect.

0 Likes

In my experience, "freeze when idle" is when I leave the machine idle (typically overnight) and when I return to it, it responds to nothing -- the machine is still on, but the mouse, keyboard and network are all (apparently) ignored.

What you describe, however, is the machine responding to the mouse when you returned to it, but then freezing -- which is not quite the same.

Having read 196683 – Random Soft Lockup on new Ryzen build​, it seems to me that there may be two or more problems which result in a frozen system.  Indeed, the initial report in that thread sounds more like the problem you report. This "random soft lockup" is characterized by logging messages of the form:

   NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DOM Worker:1364]

which appear, possibly many times, before the system freezes completely.

The "freeze when idle", on the other hand, has no precursor messages... the system just freezes (and logging just stops).

Many moons ago I did see "random soft lockup" (complete with logging messages), but that went away, and I just got the "freeze when idle".  Sadly, I have no idea why the "random soft lockup" went away. 

It could be that "freeze when idle" and "random soft lockup" are related problems, or even variants of the same problem.

Or it could be that these are different problems, which could be why "Typical current idle" does not work for you.

Who knows ?

AMD and/or the Linux Kernel folk might know -- but they have taken a vow of silence :-(

0 Likes
zankuro
Journeyman III

Hi. I have a notebook Asus ROG Strix with Ryzen 7 1700 and it are freezing on idle too. My CPU was manufactured on 2017 46th week (UA 1746PGS, yes, I need to repaste my CPU) so, no segfault bug here, but idle freeze. I'm with Linux (Ubuntu 18.04.1). As a notebook, the option "Typical current idle" does not exist, so I tried several other things. Various kernel versions (from 4.15 to 4.19-rc5), parameters, etc. So far, what has resolved, was to compile a custom kernel (I compiled 4.18.9) with the option "CONFIG_RCU_NOCB_CPU" together with the parameter "rcu-nocbs=0-15". I confirm that this solves the problem. When Ubuntu 18.10 is released, instead of compiling a custom kernel, I'll just try to disable C6 state with zenstates.py.

0 Likes

Before "Typical Current Idle" I suffered freezes on various kernel versions.   My impression is that "CONFIG_RCU_NOCB_CPU=y" + "rcu-nocbs=0-15" + "zenstates.py --c6-package-disable" reduced the frequency of freezes -- the longest I went without a freeze was ~12 days.

In my experience, it is possible to know when something does *not* solve the problem -- the machine freezes.

On the other hand, with "Typical Current Idle" it is ~5 months since I last saw a "freeze when idle" -- so far, so good.

0 Likes

imshalla​, There is a new BIOS update for the Asus X370 Pro. Any luck with it in its default settings ?

PRIME X370-PRO BIOS & FIRMWARE | Motherboards | ASUS USA

Version 4024 2018/09/28          8.16 MBytes

PRIME X370-PRO BIOS 4024
1. Improve system performance

More tight-lipping here too. They never clearly say, what they fixed.

0 Likes

Omertà :-(

As noted above, I had a look yesterday to see what I could discover about the "Power Supply Idle Control" options.

Surprisingly, for BIOS 4012 the "Typical Current Idle" option does not take (full) effect until after a cold boot -- or, at least, C6 Package is not disabled until after the cold boot !!

I have just installed 4024.  As far as I can tell:

  1. the "Auto" (default) setting for "Power Supply Idle Control", is still "Low Current Idle".
  2. but setting "Typical Current Idle" no longer requires a cold boot -- hurrah !

The effect of the options appear unchanged, so I will continue with "Typical Current Idle".

samx
Adept II

Guys I'm having the same problem of freezing and restarts when i do simple tasks like watching youtube or just idle. I even changed my PSU but same problem. BUT this problem started after I updated to windows 10 1809(the version that released in October 2018). I am reading about disabling c states and the "typical current idle". Can this problem happen on windows too? If i remember even my Windows installation hung(The spinning dots just stopped in motion) when I tried to install it 3 days ago and the machine was restarting every 30 mins. Since yesterday it only restarted twice in a day while doing light tasks.

I have ryzen 1600 + gigabyte b450 motherboard. Please suggest what to do. I was almost going to RMA my board until I read this thread...Can the cpu shut down during windows install due to the "current idle" thing ? ( I also did a bios update a few days ago after which this started but i have reverted to the old bios but still it restarted once yesterday) .I really appreciate your replies.

0 Likes

The fault which is the topic of this discussion is:

  1. believed to be Linux specific... Windows is thought not to be affected in the same way (or to have been updated long ago to avoid the fault).
  2. affects machines (running Linux) when they go idle, not when they are busy doing something.

so I really don't think the BIOS "Typical Current Idle" option is likely to help you.

I have seen it suggested that the Windows "Core Parking" mechanism is, by default, disabled (at least for Ryzen).  And that may be why Windows does not suffer the "freeze when idle" problem.

You say your problems started with a Windows 10 update, which suggests that is the more likely cause... particularly if previous versions of Windows 10 worked OK for you.

0 Likes

Arghh...My PC freezes in idle and rebooted and when i booted into BIOS, it froze in BIOS also. It's cutting the power to my CPU randomly. no idea why...so frustrating man...should i rma the board?

0 Likes

Sounds like a sick machine -- could be almost anything.

Your first posting suggested that the problems started when you upgraded to Windows 10 "1809".

If the machine had worked satisfactorily for some time before the upgrade, the obvious thing would be to go back to the pre-upgrade state (which would include the previous PSU).

If the machine has never really worked properly, then I guess you will need to try to establish which component is faulty... CPU, cooling, PSU, RAM, video card, drive(s), SSD, motherboard, etc.

But I regret I am unable to offer any further advice.

Thanks for sorting this out, finally my small home server with several kvm virtual servers is running already for longer than a day without dying a silent dead.

ASUS Prime X370 PRO AMD Ryzen 5 1600x Bios 4024 running Open suse LEAP 15. I changed as you did sugested Power Supply Idle Control" to "Typical Current Idle"

Messages in the warn.log that did lead me to this thread.

2018-10-16T19:37:50.334315+02:00 xxxxx kernel: [0.136563] mtrr: your CPUs had inconsistent variable MTRR settings
2018-10-16T19:37:50.334321+02:00 xxxxx kernel: [0.141191] ACPI Error: Needed [Integer/String/Buffer], found [Region] ffff880187d94af8 (20170303/exresop-424)
2018-10-16T19:37:50.334321+02:00 xxxxx kernel: [0.141197] ACPI Exception: AE_AML_OPERAND_TYPE, Could not execute arguments for [IOB2] (Region) (20170303/nsinit-412)
2018-10-16T19:37:50.334717+02:00 xxxxx kernel: [1.479087]  PPR NX GT IA GA PC GA_vAPIC
2018-10-16T19:37:50.334758+02:00 xxxxx kernel: [1.514219] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
0 Likes

Just chipping in, AMD Ryzen 7 1800X on Gigabyte X470 AORUS Gaming 7 WiFi, I had the same issue using both Windows 10 and Linux Mint/Cinnamon 19. The above fix worked: changing CPU Idle Power to "Typical Current Idle" in the most recent BIOS (F4 08/08/2018) resolved system hangs while idling overnight.

So this continues to be a problem with Ryzen setups, but the "fix" appears to be a solid one. Thanks for all the work you've done, it's appreciated.

0 Likes

My board is exactly same as your except it is not the wifi version just X470. I changed my BIOS (ver F4E) to "Typical Current Idle" today, it still freezes. Then I added linux kernel bootup parameter idle=nowait later, and still freezes.

Yesterday the screen can shake like this when it freezes

Screen when CPU freezes - YouTube

0 Likes
uyuy
Adept II

Dear all,

I am using Ryzen 2700x 8 core with Kubuntu 18.10 Linux version 4.18.0-11-generic Kernel, Gigabyte X470 board BIOS was F3 originally.

I run VMWARE workstation with 6 virtual machines all servers. Lockup happens with my Xorg essentially, to my VM servers much less, and most of the time I can still ssh into this box.

My X.org will lockup with screen frozen mouse & keyboard dead after 15mins and vm servers inside this box can run for the whole day, until night time where no users are accession them then servers also locked up. Yet, I almost never fail to be able to ssh into this box.

My best workaround so far is to login as root and restart desktop manager to bring the box back to life. This way my virtual machines will still run their shutdown scripts. I have scripts to auto-start virtual machines when my desktop restarts.

sudo systemctl restart sddm

I would like to ask the forum members if you have encounter any way to DETECT the hangup? I suspect in /sys or /proc or dmesg somewhere we can find something to detect the freeze, and use a script to restart automatically as a temporary work around.

so far, I can only find a syslog warning that says some threads not responding for more than 120 seconds. The threads in my case are all kworker threads.

I tried newer version BIOS, F4E it became even more frequently freezing! I tried the zenstate.py script made no diff for me. I disabled C6 powerstate in BIOS no diff too. The next thing I will try is the kernel parameters /arguments.

Thanks

uy

0 Likes

I came to awareness of anther forum reddit.com discussion on the same issue. And there is apparently useful information.

Pse refer to this URL, which I found by searching

https://www.reddit.com/r/Amd/comments/8yzvxz/ryzen_c6_state_sleep_power_supply_common_current/

Disable AMD Cool N' Quiet may help and I have yet try it myself, I will try that the next reboot.

0 Likes

I found this

What are the CPU c-states? How to check and monitor the CPU c-state usage in Linux per CPU and core?...

& this

cpupower-monitor - Report processor frequency and idle statistics - Linux Man Pages (1)

And I highly believe that from

/sys/devices/system/cpu/cpu*/cpuidle/state*/

Some kind of monitoring is possible using information from there, and some how can write a script to reset the sleeping CPUs back to live? I don't know how to do this yet. But can members from this forum workout something and share pse?

0 Likes

I wish I could help you here :-(

The "freeze-while-idle" problem I have seen, for which the "Typical Current Idle" BIOS option appears to be a fix (or at least a work around), brings the entire CPU to a complete, dead stop.  Once frozen, nothing will kick the CPU back to life short of a hardware reset or power cycle.

What you describe appears different.  In particular, you can ssh in to the box and restart stuff.

It seems to me that the root cause of the "freeze-while-idle" I have seen, is some problem with the management of power once the CPU has entered a deep sleep, such that it cannot be re-awoken.  Which feels like a deep hardware problem.

A problem which causes streams of "watchdog: BUG: soft lockup" events to be logged has also been seen.  It is not clear whether that's related to the "freeze-while-idle" or whether that's a Kernel level problem.

There is talk of "idle=nomwait" being useful.  I believe that MWAIT is the way to enter the C6 state, so disabling MWAIT appears to be another way of disabling C6, which doesn't seem to advance the art.  I also believe that for virtual machines MWAIT is effectively a call from client OS to the host, and that "idle=nomwait" is not a good idea for client OSes.

Anyway... I am not convinced that the problem you are seeing is the "freeze-while-idle" problem I have seen, and hence it doesn't surprise me that the "Typical Current Idle" BIOS option (and the related C6 and MWAIT voodoo) has not fixed things.  Sadly, this probably of little use to you, except, perhaps, to encourage you to look elsewhere for a solution :-(

0 Likes

Thanks for responding Imshalla.

I tried new BIOS version & BIOS setting [new BIOS made things worse, BIOS setting Typical Current improved abit] & kernel boot parameter (idle=nowait) seems no diff.

The improvement so far I achieved is that it does not freeze while VM servers overnight idle like it consistently did before. But now it will freeze within 1~2 mins if I start to operate at the UI moving mouse etc, mouse cursor will lock up.

From the post below by SKULL on Xmas eve, I gathered that I may be having 2 different types of problems, one is idle current, another could be the sudden surge of threads and power supply voltage dropped - I suspect this was what happened when it freezes while I operated the server GUI. If what SKULL said would apply to me, I suspect that If I adjusted over-clocking voltage up a little may be able to avert some lockups? This is what I would try after X'mas.

For most of my lockup cases, the machine still respond to ssh logins, and can get back to normal after this root bash command:

sudo systemctl restart sddm

Which restarts the KDE Linux desktop manager - Only quite occasionally we had to hit the motherboard reset button.

The only clue found in /var/log/syslog is a kworker thread took more than 120 seconds to respond or something alike.

0 Likes

Dear all,

So it is Boxing Day 26 Dec 2018, I tried today to set BIOS CPU voltage from my CPU (2700x) default of 1.050V up to 1.120V, to test out according to SKULL's post.

No noticeable improvement so far, after few hours of running, if I operated GUI from console directly using mouse, it locked up twice within few hours. My favorite command now will still get me out of the frozen state at the expense of disrupting my 6 virtual machine servers inside:

sudo systemctl restart sddm

My search during Xmas Day on web discovered this document:

https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-system-cpu

With the better understanding from above document helped, I checked and confirmed my current BIOS & Kernel Boot up parameters forced my 16 logical CPUs all to run only in Pstates 0 1  & 2 which meant POLL C1 & C2 only. This was how I checked:

$ cat /sys/devices/system/cpu/cpu*/cpuidle/state*/name

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

I am now quite sure that this system is not going into C6 power state.

The same command above tested on other system (Intel) showed it will go into POLL C1 C1E C3 & C6 power states.

Hence I can now confirm that my existing system improvement over my original state (consistently frozen overnight - which no longer happen now) was due to BIOS & Kernel Boot Up Parameters forcing CPUs to stay away from C6 power state. I am sharing this method of test, so other members here can verify weather CPUs goes into C6 or not.

In my other Intel System I can see the integer number of times each CPUs entered C6 power states by this command:

$ cat /sys/devices/system/cpu/cpu*/cpuidle/state4/usage

Hence also now I am rather sure that I still face another aspect of freeze up which is nothing to do with C6 any more, and it is the one that happens quite soon when I operated system console using mouse. My mouse cursor will freeze and all GUI frozen. But all VMs and SSH are ok still, I can SSH in and issue my favorite soft restart desktop manager command to get back to normal. For this aspect I still believe in what SKULL posted that when GUI operation launched lots of threads the CPU Core power supply voltage might had dropped to cause freeze up. For this I will email to my supplier Gigabyte to verify.

0 Likes
skull
Adept I

All,

Decided to chime in on this as this appears to be one of the longest running and most focused threads on this issue and maybe?? seen by AMD should really respond to this.

First some insight we deployed 15 AMD Ryzen systems in 2017 using various Kernels and they all ran idling some of the time and with vastly varying loads most of the time without any stability issues.    What was common though with all 15 is since they needed every bit of performance so we modestly (about 200Mhz) over clocked them and I think most importantly after seeing a few lockups at 1.35V set the CPU voltage to 1.40V.     So this had a very good track record (year+ now over many systems).    I also did disable "Cool and Quiet"  as it has a history of problems with Linux going back pre Ryzen.    Besides those changes no other BIOS changes or Kernel paramaters.    All these machines were Ryzen 1800, 1800X, or 1600.     Most had no graphics running text mode only.

Move ahead to this year and due to updates to our software we no longer needed to overclock and figured this would save power and generate less heat (very important for 1 of the installs).    These systems run processes that vary in load a lot all the way from idle (rarely) to it running full out.     All 6 machines that we have delivered in this config have locked up infrequently, no log messages, no ping, needed reset/power cycle to recover and one of the systems has locked and does lock up frequently (every few days) even though we swapped the motherboard and CPU.    I have not tried the fix that many here seems to solve (the Power Supply Idle Current).     Some of these were 2700 Ryzen.

Much of what we are experiencing harkens back to the early Haswell days.    It is my understanding although poorly documented by AMD   that the Ryzen/TR/Epyc chips utilize a third level of voltage regulation on a substrate above the die that allows the chip (independent of C or P states) to set the voltage that each core sees.    This is similar to what Haswell did and that Intel moved away from after Haswell (mostly due to heat problems on die).    So for the CPU to see the right voltage it takes 12V from system power supply, regulates this down to the PState CPU voltage (typically 1.35 to 0.9V) and then can further regulate this down to whatever (some say as low as 0.6V).   This is all fine but the problem that it creates is when you go from idle or near idle to near full the regulators must respond very quickly and if this dance does not happen right the CPU is voltage starved and hard locks.     Doing this dance was an issue for Intel with Haswell and supposedly through both microcode and BIOS changes they eventually got it right.    

I think AMD has to learn how to dance correctly!

I have some evidence of this where I connected a Scope to 12V rail of a EVGA 750W supply and at times the voltage would quickly drop to as little as 10V (2V drop) if I started a process instantly ramped up 16 full out threads.    While with an intel CPU (4 core) going from idle to Max quickly resulted in less than a 0.5V drop and over a bit more time.

Going back to overclocking in this case from what I have read overclocking disables the internal per core VRM's on Ryzen leaving just the standard MOBO VRM it also sets this VRM to whatever you tell it as the only P-State.    I also made the VRM voltage the Max at 1.40V.   As a result static power ifs high (hot even when doing nothing) but as I noted above system seems stable regardless of load.    Essentially it simplifies the dance making a Tango into a simple walk across the floor.

This BIOS option noted in this thread may indeed do much more than limit C States not sure and not sure how this helps the situation of going vastly varying CPU loads causing the hard lock which is I believe similar to the idle problem.

Plan to experiment more after the holidays and report what I find here.

0 Likes

In March of this year AMD advised me that for Ryzen they specify a PSU which can deliver 12V at 0A.  Which adds to the feeling that this is a hardware problem.  Mind you, having such a PSU did not solve the problem for me.  Also, disabling package C6 did not solve the problem.  But I have seen no "freeze-while-idle" since setting the "Typical Current Idle" BIOS option.

I have seen conflicting reports as to whether the 2xxx Ryzen suffers in the same way.  I have no direct experience of that CPU.  (And no urgent desire to buy another AMD, however tempting the 3xxx devices may be.)

I have seen reports of avoiding the problem by tweaking voltages and other overclocking dark arts.

I'm intrigued by the idea that it could be some sort of inrush issue... but in my case, the load when waking up the system would be just enough to wake the screen up and respond to mouse/keyboard, or enough to accept an ssh connection.

0 Likes

For what is is worth, I just installed a new power supply and the latest ASUS Prim X370 Pro Bios (PRIME X370-PRO BIOS 4207)

If there is a need to I am willing to set the bios setting Power Supply Idle Control" back to default and see what happens. At this moment the server is running without any issues since my last posting which was in October (no more system lockups).

The new power supply is a Corsair RM850x.

The server is holding 4 VM's running 24/7.

No overclocking or other tuning as I only need stability.

0 Likes

care to tell me which motherboard do you have?

0 Likes

Motherboard Asus Prime X370-Pro

0 Likes

Yup. It's your motherboard. I had the same problem and i researched for months and then changed the motherboard to Asus TUF B450 PLUS and boom! problem gone. It has nothing to do with Ryzen CPU or linux. My idle freezes would happen on windows and even in BIOS!!  A lot of B350/370 ASUS AND ASROCK Mobos have this problem. It's a motherboard fault. You can workaround with some BIOS settings but ultimately its your mobo at fault. Save your time and energy and Either RMA the board under warranty or get a cheap A320 board and see your problem disappear.

I do not have any issues since I set the Bios value Power Supply Idle Control" to "Typical Current Idle". So for me no need to exchange the board.