cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

uyuy
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Dear all,

So it is Boxing Day 26 Dec 2018, I tried today to set BIOS CPU voltage from my CPU (2700x) default of 1.050V up to 1.120V, to test out according to SKULL's post.

No noticeable improvement so far, after few hours of running, if I operated GUI from console directly using mouse, it locked up twice within few hours. My favorite command now will still get me out of the frozen state at the expense of disrupting my 6 virtual machine servers inside:

sudo systemctl restart sddm

My search during Xmas Day on web discovered this document:

https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-system-cpu

With the better understanding from above document helped, I checked and confirmed my current BIOS & Kernel Boot up parameters forced my 16 logical CPUs all to run only in Pstates 0 1  & 2 which meant POLL C1 & C2 only. This was how I checked:

$ cat /sys/devices/system/cpu/cpu*/cpuidle/state*/name

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

I am now quite sure that this system is not going into C6 power state.

The same command above tested on other system (Intel) showed it will go into POLL C1 C1E C3 & C6 power states.

Hence I can now confirm that my existing system improvement over my original state (consistently frozen overnight - which no longer happen now) was due to BIOS & Kernel Boot Up Parameters forcing CPUs to stay away from C6 power state. I am sharing this method of test, so other members here can verify weather CPUs goes into C6 or not.

In my other Intel System I can see the integer number of times each CPUs entered C6 power states by this command:

$ cat /sys/devices/system/cpu/cpu*/cpuidle/state4/usage

Hence also now I am rather sure that I still face another aspect of freeze up which is nothing to do with C6 any more, and it is the one that happens quite soon when I operated system console using mouse. My mouse cursor will freeze and all GUI frozen. But all VMs and SSH are ok still, I can SSH in and issue my favorite soft restart desktop manager command to get back to normal. For this aspect I still believe in what SKULL posted that when GUI operation launched lots of threads the CPU Core power supply voltage might had dropped to cause freeze up. For this I will email to my supplier Gigabyte to verify.

0 Kudos
uyuy
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Dear All,

Today my new discovery indicated that we may be heading wrong direction with regards to CPU core voltage and power states. It has got to be something else.

265px-Ksysguard1.png

I use the famous linux top command and ksysguard (above imgs) and I sort of AMBUSH the problem awaited to solidly catch a process that frozen.

And my chance came today. I caught my Virtual Machines Backup crontab jobs frozen at the vmware's vmrun suspend command. Info:

https://docs.vmware.com/en/VMware-Fusion/11/com.vmware.fusion.using.doc/GUID-24F54E24-EFB0-4E94-8A07...

My cron jobs put each virtual machines into suspend mode and backup into a harddisk. I got a clue few days ago when I check through my backups, their folder date time stamps suggested that the usual backup jobs which should all be done within 30 mins normally, had on 2 occasions took several hours! There was nothing else wrong beside the long time spent at late night to backup, the data seem quite completely backed up. That means, the lockup or freeze could unfreeze themselves and proceeded to a long delayed completion.

So I ssh into this Ryzen machine at my crontab job hour today, forwarded X and ran ksysguard and top at remote desktop. Yes the cron job frozen and backup was not happening. I also used the linux ps -aux | grep crontab & similar commands, it was confirm that crontab was hanging awaiting for vmrun to suspend the vm, and this command just frozen. It fronzen for almost 2 hours! & later it completed it after this long delay. And my script went ahead further to backup another virtual machine, and after backing up, it is suppose to do vmrun resume but agian, the resume frozen up and took more than 1 hour. After this even my ssh -X session died. I can not reconnect again.

During these hours, I had the top command and ksysguard showing me that other processes and thread were running, ALL my 16 logical (8 physical) CPUs were RUNNING! None of the CPU cores were frozen up in C6 or any other power states, while the thread hang for hours. Because of Hyperthreading, each 2 logical CPUs are from 1 single physical CPU core, and if any core locked up in power state during these hours of lockup, the graphs of 2 logical CPUs must die for each physical CPU to freeze in deep sleep state. If 2 physical cores locked up, than graphs of 2 logical CPUs must die (ZERO % usage).

I am very sure of my observations. It was repeated twice during my AMBUSH mission today. I am very sure of how my scripts work, and how vmrun works, this similar setup and script had worked for more than 10 years, and used on older AMD and Intel machines. This Ryzen is a recent replacement for the retired old server.

I am now not inclined to believe that CPU cores were frozen in deep sleep power states, nor it was Typical Current Idle issue. Not for my Ryzen machine anyway. It has to be something else, RANDOMLY LOCKING UP, and RANDOMLY UNLOCKED THEMSELVES, Affecting process / thread that also appear to be random. I checked the PIDs of these locked up jobs, top said they were in idle state.

While it was locked I went into various /proc folders and files to sniff for clues, did not get anything too useful except to see that they were idle

/proc/[PID]/status

/proc/[PID]/task/[PID]/status

My favorite soft reset systemctl restart sddm had worked many times nearly without fail because I think it flushed out and killed the hanging threads, this command killed X and everything else running on X, which will be quite a big number, and it restarted KDE desktop manager.

I am hoping to get a further breakthrough to find out what caused the thread to LOCK-UP & UNLOCK themselves.

Cheers.

0 Kudos
uyuy
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Dear all,

I am now inclined to believe that Linux Kernel 4.18.0-11 gives most of my (remaining) problem because my system most of the time no longer require a reset button on motherboard to get working, like mentioned here, my favorite sudo systemctl restart sddm command via SSH could get the system going again, without a total reboot. My ksysguard graphs shows all CPUs and cores running while backup jobs hanged for hours - the final score on 27-Dec-2018 was a delay over 12hours and the backup crontab script was completed - it resumed itself after hours. This is I think quite different from those of you here who are forced to hit power button or reset button.

I found Kubuntu bug thread and posted there as well, URL is below. May be useful for your reference. The Intel users having similar problems there.

Comment #24 : Bug #1798961 : Bugs : linux package : Ubuntu

0 Kudos
bryan444
Journeyman III
Journeyman III

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Thanks - "typical current idle" also worked for me - Ryzen 1800X - ASUS Prime X370-Pro MB

0 Kudos
skull
Adept I
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

I have used ASRock 350M Pro4, A couple 350m based ASUS boards, and a gigabyte 350 based board.

When overclocked @1.4V as I described they all seemed stable.  

However when not overclocked the AsRock and Asus periodically hard locked.

I have not tried the gigabyte in a non-overclocked config.

Following recent posts it seems like the lockups happen more with AsRock and Asus I will note that at least Gigabyte uses a higher frequency VRM than Asrock or ASUS which by doing this the VRM will respond faster to load changes maybe this is why more lockups are seen with AsRock esp.

The drawback of higher frequency VRM's is they are less efficient and run hotter at light loads (which for most desktops is 99% of time).

0 Kudos
skull
Adept I
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

All,

This thread may be discussing one problem that manifests itself in multiple ways or multiple problems, not sure but interesting point of discussion.    Here is my take on it:

I do much electronics design and have witnessed first hand how power rail problems can end up resulting in a range of strange issues many that would make one think it is a software issue.

The original start of this thread Linux Kernel Bug 196683 can certainly be explained by a power rail issue causing the CCX lockup and other complete hard locks could simply be the result of both CCX's locking up leaving you with no active CPU.

Could these come out of lockup on there own as uyuy seems to have observed, possibly?  or maybe the scheduler eventually moves his process to the CCX that is not locked up? 

In fact with the whole ccx and hard lock issue you would think AMD microcode could detect it and just reset (as most processors do) so at least you get a system that is back up and running!

The idle situation is interesting as is the solution that for many has seemed to solve it "Typical idle current BIOS setting".   Most VRM's and some PSU's go into another mode when current draw falls enough, they do this to run more efficiently saving power.    However it takes some time to go from this low power mode to normal mode and in this time the CPU can be voltage starved.    Raising the idle current (through BIOS setting) likely prevents the VRM from falling into this mode meaning that it can rapidly respond when the core(s) enters a higher C State.    I will note that the problem does not occur when idle but when a process wakes taking a CPU core(s) out of C6 or lowest allowed C State.

It could also be that the reason Windows users do not see this frequently is that MS has optimized Windows for laptop use and avoid background processes that frequently wake.   As this likely only happens on a tiny fraction of a percent of the CPU idle->normal transitions it would stand to reason that this issue is much more rare on Windows.    I have also heard that Windows manages C/P state transitions differently from Linux but do not have any details on this.

Note this is a theory but seems to well explain why overclocking or this new BIOS setting create a stable situation and doing nothing results in random instabilities.   

The reason this may be daunting for AMD is that the fault does not lie with any one component. The PSU, Mobo

VRM, and CPU and its microcode all have to dance properly for things to work right. This likely also

explains why for some replacing Mobo solved or minimized the problem, for others the PSU, and even differences/tolerances

in the analog part of the Ryzen chip (onboard per core regulator) can make a difference.

I plan to do a test of the current in and voltage out of the VRM when coming out of idle both with the Typical Idle current setting and with the BIOS defaults.    I will post these results.

shinobi
Adept I
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Thanks for the info!, Will wait for the results!!

0 Kudos
shinobi
Adept I
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Can you post some close-up pictures & if possible provide some links to the VRM's specs of the Asus TUF B450 PLUS, board ?

May be the hardware engineers here, can help in doing a comparison!

0 Kudos
uyuy
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Dear All,

Some good news and discovery.

My crisis is greatly improved so far after 1st 5 hours running without lockup now. All I did essentially was changing my Linux Kernel from 4.18.0-11-generic to 4.15.0-43-generic

I had previously also tried 4.18.0-13-generic and found it equally bad.

My highest suspicion is 4.18.0-X kernel's thread scheduler  is/are buggy with a same bug that would freeze up some threads randomly and up to 12hours long and later randomly unfreeze them. I call that random because I can not find any consistent pattern on how it freeze / unfreeze. These hardly require a hard reset unless it is left frozen for very long time. If I discovered soon enough and gave soft reset by SSH command sudo systemctl restart sddm it will be recovered. It would be gdm instead of sddm if you are in ubuntu instead of kubuntu.

My guess for this difference (between requiring a motherboard reset switch vs soft reset command) is that TOO MANY REPEATED THREAD FROZEN OVER LONGER TIME UNATTENDED. It is a guess only because I cannot afford the time to test and prove that. My faithful logical analysis and derivation is so, because this kernel thread scheduler bug will freeze more & more threads than it unfreeze over longer unattended time, and that critical kernel or driver module threads or ssh or bash itself could have been frozen, hence you have no more chance to soft reset / recover.

I have proven that when only 1 or 2 threads frozen, servers, ssh, bash, and even ksysguard (CPUs usage / load percentage graphs) will still be running and I never found any single CPU core nor logical CPU (hyperthread) completely stuck in ZERO% usage.

265px-Ksysguard1.png

When my X.org console freezes, mouse will freeze and CPU usage graph will all freeze, but usually still a good chance if I quickly ssh my favorite reset command sudo systemctl restart sddm it will be recovered. If I wasn't checking and left it frozen for long time, there had been a high chance of it completely not recoverable via ssh command, and reset switch became the only way to get system back rebooted up.

Today, when I checked my CPU Pstate via kernel, it is not running any C6, but I mt BIOS setting neither DISABLED C6 nor use TYPICAL CURRENT IDLE, nor I am using kernel boot idle=nowait , but I think my F4E version BIOS by Gigabyte X470 had DISABLED C6 power state & forced TYPICAL CURRENT IDLE:

~$ cat /sys/devices/system/cpu/cpu*/cpuidle/state*/name

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

C1

C2

POLL

From existing state of stability I am optimistic to expect no further debugging on my system for now.

My proposal for Kubuntu/Ubuntu users is to check kernel version to be other than version 4.18.0-X , and try older 4.15.x 1st, and newer version when they released, if your stability improved with alternate kernels than stay with them and await for improved kernels and try them when they became available.

Thanks & regards

uy

0 Kudos
imshalla
Adept II
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Interestingly, amdmatt found your reply useful... he also found bthruxton​ reply to you useful.

samx​ stated (

<https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf​> mentions three MWAIT issues:

  1057 MWAIT or MWAITX Instructions May Fail to Correctly Exit From

  the Monitor Event Pending State

  1059 In Real Mode or Virtual-8086 Mode MWAIT or MWAITX Instructions May

  Fail to Correctly Exit From the Monitor Event Pending State

  1109 MWAIT Instruction May Hang a Thread

And some people believe that one or more of these may be related to the "freeze-while-idle", and hence that the Linux Kernel setting "idle=nomwait" is the way forward.  (I guess we can ignore the Real Mode or Virtual-8086 Mode issues !)

So, for completeness, can AMD tell us (c) which, if any, of these MWAIT issues are addressed by

Chris

0 Kudos