cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

imshalla
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

The fault which is the topic of this discussion is:

  1. believed to be Linux specific... Windows is thought not to be affected in the same way (or to have been updated long ago to avoid the fault).
  2. affects machines (running Linux) when they go idle, not when they are busy doing something.

so I really don't think the BIOS "Typical Current Idle" option is likely to help you.

I have seen it suggested that the Windows "Core Parking" mechanism is, by default, disabled (at least for Ryzen).  And that may be why Windows does not suffer the "freeze when idle" problem.

You say your problems started with a Windows 10 update, which suggests that is the more likely cause... particularly if previous versions of Windows 10 worked OK for you.

0 Kudos
samx
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Arghh...My PC freezes in idle and rebooted and when i booted into BIOS, it froze in BIOS also. It's cutting the power to my CPU randomly. no idea why...so frustrating man...should i rma the board?

0 Kudos
imshalla
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Sounds like a sick machine -- could be almost anything.

Your first posting suggested that the problems started when you upgraded to Windows 10 "1809".

If the machine had worked satisfactorily for some time before the upgrade, the obvious thing would be to go back to the pre-upgrade state (which would include the previous PSU).

If the machine has never really worked properly, then I guess you will need to try to establish which component is faulty... CPU, cooling, PSU, RAM, video card, drive(s), SSD, motherboard, etc.

But I regret I am unable to offer any further advice.

bthruxton
Adept I
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Thanks for sorting this out, finally my small home server with several kvm virtual servers is running already for longer than a day without dying a silent dead.

ASUS Prime X370 PRO AMD Ryzen 5 1600x Bios 4024 running Open suse LEAP 15. I changed as you did sugested Power Supply Idle Control" to "Typical Current Idle"

Messages in the warn.log that did lead me to this thread.

2018-10-16T19:37:50.334315+02:00 xxxxx kernel: [0.136563] mtrr: your CPUs had inconsistent variable MTRR settings
2018-10-16T19:37:50.334321+02:00 xxxxx kernel: [0.141191] ACPI Error: Needed [Integer/String/Buffer], found [Region] ffff880187d94af8 (20170303/exresop-424)
2018-10-16T19:37:50.334321+02:00 xxxxx kernel: [0.141197] ACPI Exception: AE_AML_OPERAND_TYPE, Could not execute arguments for [IOB2] (Region) (20170303/nsinit-412)
2018-10-16T19:37:50.334717+02:00 xxxxx kernel: [1.479087]  PPR NX GT IA GA PC GA_vAPIC
2018-10-16T19:37:50.334758+02:00 xxxxx kernel: [1.514219] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
0 Kudos
lunam
Journeyman III
Journeyman III

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Just chipping in, AMD Ryzen 7 1800X on Gigabyte X470 AORUS Gaming 7 WiFi, I had the same issue using both Windows 10 and Linux Mint/Cinnamon 19. The above fix worked: changing CPU Idle Power to "Typical Current Idle" in the most recent BIOS (F4 08/08/2018) resolved system hangs while idling overnight.

So this continues to be a problem with Ryzen setups, but the "fix" appears to be a solid one. Thanks for all the work you've done, it's appreciated.

0 Kudos
uyuy
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Dear all,

I am using Ryzen 2700x 8 core with Kubuntu 18.10 Linux version 4.18.0-11-generic Kernel, Gigabyte X470 board BIOS was F3 originally.

I run VMWARE workstation with 6 virtual machines all servers. Lockup happens with my Xorg essentially, to my VM servers much less, and most of the time I can still ssh into this box.

My X.org will lockup with screen frozen mouse & keyboard dead after 15mins and vm servers inside this box can run for the whole day, until night time where no users are accession them then servers also locked up. Yet, I almost never fail to be able to ssh into this box.

My best workaround so far is to login as root and restart desktop manager to bring the box back to life. This way my virtual machines will still run their shutdown scripts. I have scripts to auto-start virtual machines when my desktop restarts.

sudo systemctl restart sddm

I would like to ask the forum members if you have encounter any way to DETECT the hangup? I suspect in /sys or /proc or dmesg somewhere we can find something to detect the freeze, and use a script to restart automatically as a temporary work around.

so far, I can only find a syslog warning that says some threads not responding for more than 120 seconds. The threads in my case are all kworker threads.

I tried newer version BIOS, F4E it became even more frequently freezing! I tried the zenstate.py script made no diff for me. I disabled C6 powerstate in BIOS no diff too. The next thing I will try is the kernel parameters /arguments.

Thanks

uy

0 Kudos
uyuy
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

I came to awareness of anther forum reddit.com discussion on the same issue. And there is apparently useful information.

Pse refer to this URL, which I found by searching

https://www.reddit.com/r/Amd/comments/8yzvxz/ryzen_c6_state_sleep_power_supply_common_current/

Disable AMD Cool N' Quiet may help and I have yet try it myself, I will try that the next reboot.

0 Kudos
uyuy
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

My board is exactly same as your except it is not the wifi version just X470. I changed my BIOS (ver F4E) to "Typical Current Idle" today, it still freezes. Then I added linux kernel bootup parameter idle=nowait later, and still freezes.

Yesterday the screen can shake like this when it freezes

Screen when CPU freezes - YouTube

0 Kudos
uyuy
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

I found this

What are the CPU c-states? How to check and monitor the CPU c-state usage in Linux per CPU and core?...

& this

cpupower-monitor - Report processor frequency and idle statistics - Linux Man Pages (1)

And I highly believe that from

/sys/devices/system/cpu/cpu*/cpuidle/state*/

Some kind of monitoring is possible using information from there, and some how can write a script to reset the sleeping CPUs back to live? I don't know how to do this yet. But can members from this forum workout something and share pse?

0 Kudos
skull
Adept I
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

All,

Decided to chime in on this as this appears to be one of the longest running and most focused threads on this issue and maybe?? seen by AMD should really respond to this.

First some insight we deployed 15 AMD Ryzen systems in 2017 using various Kernels and they all ran idling some of the time and with vastly varying loads most of the time without any stability issues.    What was common though with all 15 is since they needed every bit of performance so we modestly (about 200Mhz) over clocked them and I think most importantly after seeing a few lockups at 1.35V set the CPU voltage to 1.40V.     So this had a very good track record (year+ now over many systems).    I also did disable "Cool and Quiet"  as it has a history of problems with Linux going back pre Ryzen.    Besides those changes no other BIOS changes or Kernel paramaters.    All these machines were Ryzen 1800, 1800X, or 1600.     Most had no graphics running text mode only.

Move ahead to this year and due to updates to our software we no longer needed to overclock and figured this would save power and generate less heat (very important for 1 of the installs).    These systems run processes that vary in load a lot all the way from idle (rarely) to it running full out.     All 6 machines that we have delivered in this config have locked up infrequently, no log messages, no ping, needed reset/power cycle to recover and one of the systems has locked and does lock up frequently (every few days) even though we swapped the motherboard and CPU.    I have not tried the fix that many here seems to solve (the Power Supply Idle Current).     Some of these were 2700 Ryzen.

Much of what we are experiencing harkens back to the early Haswell days.    It is my understanding although poorly documented by AMD   that the Ryzen/TR/Epyc chips utilize a third level of voltage regulation on a substrate above the die that allows the chip (independent of C or P states) to set the voltage that each core sees.    This is similar to what Haswell did and that Intel moved away from after Haswell (mostly due to heat problems on die).    So for the CPU to see the right voltage it takes 12V from system power supply, regulates this down to the PState CPU voltage (typically 1.35 to 0.9V) and then can further regulate this down to whatever (some say as low as 0.6V).   This is all fine but the problem that it creates is when you go from idle or near idle to near full the regulators must respond very quickly and if this dance does not happen right the CPU is voltage starved and hard locks.     Doing this dance was an issue for Intel with Haswell and supposedly through both microcode and BIOS changes they eventually got it right.    

I think AMD has to learn how to dance correctly!

I have some evidence of this where I connected a Scope to 12V rail of a EVGA 750W supply and at times the voltage would quickly drop to as little as 10V (2V drop) if I started a process instantly ramped up 16 full out threads.    While with an intel CPU (4 core) going from idle to Max quickly resulted in less than a 0.5V drop and over a bit more time.

Going back to overclocking in this case from what I have read overclocking disables the internal per core VRM's on Ryzen leaving just the standard MOBO VRM it also sets this VRM to whatever you tell it as the only P-State.    I also made the VRM voltage the Max at 1.40V.   As a result static power ifs high (hot even when doing nothing) but as I noted above system seems stable regardless of load.    Essentially it simplifies the dance making a Tango into a simple walk across the floor.

This BIOS option noted in this thread may indeed do much more than limit C States not sure and not sure how this helps the situation of going vastly varying CPU loads causing the hard lock which is I believe similar to the idle problem.

Plan to experiment more after the holidays and report what I find here.

0 Kudos