Browsing around my ASRock X370 Taichi firmware settings, I found this one:
Advanced > AMD CBS > Zen Common Options > Power Supply Idle Control.
I changed it from auto to low, let's see if it will help with stock kernel.
I'm now also seeing a lot of these in dmesg:
[11225.078807] x86: Booting SMP configuration:
[11225.078808] smpboot: Booting Node 0 Processor 1 APIC 0x1
[11225.081035] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[11225.081063] cache: parent cpu1 should not be sleeping
[11225.081127] microcode: CPU1: patch_level=0x08001129
[11225.081213] CPU1 is up
[11225.081233] smpboot: Booting Node 0 Processor 2 APIC 0x2
And so on for all 16 virtual cores.
I received this from "AMD Support" on 01-Mar-2018 (at 11:50):
This is an automatically generated email please do not reply
Your service request SR# 8200794428 has been escalated to one of our
experts who can better address your questions. We appreciate your
patience, and thank you for your interest in AMD.
AMD Global Customer Care.
But have not heard anything more, yet.
On a brighter note, I have now ~12 days without a freeze. I am now going to turn C6 state back on again.
That C-state 0x0 not supported by HW happens now always, so it's not related to my test above.
With Advanced > AMD CBS > Zen Common Options > Power Supply Idle Control set to "Common current idle" (instead of auto) in my ASRock X370 Taichi firmware, I didn't get any freezes in a while, so I assume it's a valid workaround.
I noticed what changes after it's set in the firmware, using zenstates.py:
When set to auto:
C6 State - Package - Enabled
C6 State - Core - Enabled
when set to Common current idle:
C6 State - Package - Disabled
C6 State - Core - Enabled
So apparently it disables package C6 state (while keeping core C6 state enabled)! Hopefully it can shed some light on what the problem is. I wonder if Ryzen 2 will be free of this issue.
What exactly is "package" in this context? Is it still part of CPU, or it's something on the motherboard?
"AMD Support" said they had escalated the issue on 01-Mar-2018, it is now 12-Mar-2018... <sigh>.
Having run for ~12 days with C6 disabled, I re-enabled it, and my machine crashed twice in 3 days.
I have now replaced the power-supply, by something newer, more efficient and which claims:
"Zero-Load design that supports Intel’s Deep Power Down C6 and C7 modes".
With C6 enabled, I rather expect the machine to crash again any day now.
This thread is consistent with my experiences. I also note that this problem is more frequently triggered by having both Firefox and Steam open at the same time.
Given that syslog loses the last few minutes of logging leading into the soft lockup, I expect we're looking at some kind of runaway condition where something is getting stuck in a loop somewhere. My Softlocks are often preceded by CPU/GPU panics, but they are not always.
I've been watching this post hoping more information would come to light since I want to build a Ryzen based linux system (I already have a windows one.) Last Sunday, I decided to start a test. I built a bare bones Ryzen 1300x system with 4 gigs and Asus X370 Pro. All the other components are from my old Amd 945 - power supply, case, storage, fans, and nvidia GTX 660 graphics card.
I updated the BIOS to the latest, installed Ubuntu Mate 16.04.3 and did all the updates. I then installed the supported nvidia proprietary drivers. It's been running without problems so far. I want to let it run for two weeks without issue before I move forward and buy 64gb of RAM.
I tried this last year and ran into trouble and I describe it here: Linux Reboots during idle time and MCE errors (Ubuntu 16.04.3.) | johnstechpages.com
What you describe is the Kernel Magic where the:
1) kernel is configured: CONFIG_RCU_NOCB_CPU=y
2) kernel command-line is used: rcu-nocbs=0-15 (depending on the CPU)
This appears to help, but is not a guaranteed fix.
AMD have told me that a PSU which will deliver 12V at 0A is required.
I (now) have one of those, but my machine froze again overnight.
There is supposed to be a BIOS fix -- "Power Supply Idle Control" -- but my ASUS Prime X370-PRO does not support it (yet).
The BIOS fix appears to disable the C6 (deep sleep) state, either (a) for the "package" or (b) for all cores. I assume that by "package" we mean that C6 is enabled for all but one of the cores. Some people have reported success with these options.
FWIW: I am now trying with C6 disabled for the "package".
I have been hoping for more information from AMD. I last wrote to AMD "TECH.SUPPORT" on 14-Mar... so far, no reply.
My test system has been running for a little more than a week now, mostly idle with no crashes. The main differences from last year when I conducted this test are:
1) Different CPU - an 1300x rather than 1800x.
2) I've populated only 1 dimm socket with 4 GB of ram rather than 4 sockets with 64 GB.
3) I'm using a GTX-660 video card rather than a Nvidia 8400 GS.
4) I'm using the nvidia proprietary driver this time around rather than the open source driver.
5) Updated BIOS and Ubuntu 16.04.
6) I'm using a Cooler Master RS 850 Power Supply I purchased in December of 2009 with 8 years of continuous use rather than a new Seasonic Prime 750.
I have not done any of the kernel configurations that are supposed to address this problem, I'm running plain vanilla. We both have the same motherboard and I assume that we both have #5 covered. What video card do you have and what drivers are you using? If it's a Nvidia card and you're using the opensource drivers, my only suggestion right now is to switch to the proprietary drivers and start the clock.
It could be that the crash at idle could be more prevalent with more cores. That's not something I can test until the new 2700X's come out. I did order the memory yesterday and will likely install sometime during the week or next weekend. I'll keep you posted.