cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

imshalla
Adept II

Ryzen linux kernel bug 196683 - Random Soft Lockup

I have Ryzen 7 1800X, ASUS Prime X370-PRO, running Fedora 26 and 27.

The damn thing has not worked properly since I bought it.

My first CPU was RMA'd, and the replacement does not appear to suffer from the SEGV fault.

However, both the original and the replacement CPU both crash regularly:

  * occasionally with streams of "watchdog: BUG: soft lockup" events being logged,

  * but mostly the system just stops and I can find no logging that tells me why.

At bugzilla.kernel.org I find bug 196683, where a "workaround" is suggested:

1) kernel configured:  CONFIG_RCU_NOCB_CPU=y

2) kernel command-line:  rcu-nocbs=0-15

But with kernel 4.14.18-300.fc27 I find the machine has stopped over night (when it is idle), every two or three days.

I have added kernel command-line "processor.max_cstate=5", which may help with the crashes, but (I assume) not with the electricity bill :-(

Does anybody understand what the real fault is ?  A "workaround" is all very well, but not entirely satisfactory.  It's not as if this is a new device any more.

0 Likes
122 Replies
shmerl
Elite

I was bitten by the same issue on Linux all the way from the time I bought Ryzen CPU (Ryzen 7 1700X). First I had this one and segfault bug. I RMA'd the chip, and segfault was gone, but random freezes / reboots never went away really. I blamed it first on motherboard, RAM and even power spikes, but nothing helped and now I found that kernel bug opened:

196683 – Random Soft Lockup on new Ryzen build

Some commented there, that AMD know about the issue and plan to fix it with microcode update. However it wasn't fixed in microcode 0x08001136 as some reported. And my motherboard (Asrock X370 Taichi) still ships even older one: 0x08001129.

Are you still planning to fix it, or the only way to do it is to replace the CPU with Ryzen 2?

imshalla
Adept II

For completeness...

...one of the theories is that disabling cstate 6 contributes to preventing random freezes.  My experience is of finding the machine frozen in the morning, when it has been idle overnight.  So some problem with the "Deep Power Down" state seems probable.

...however, the kernel-command-line option ""processor.max_cstate=5", does not appear to disable (cstate) C6 :-(  I am unable to determine what (if anything) it does do.

...but at ZenStates-Linux​ there is 'zenstates.py' which will disable C6.  And (with any luck) that may do the trick.

It is, of course, disappointing that my shiny Ryzen CPU is unreliable.  But it is *infuriating* that nobody seems to know what the problem is, or whether some or all of the suggested spells are required to avoid these apparently random freezes.

0 Likes
shmerl
Elite

Is there a point to open a support ticket for this?

0 Likes

I have sent a message via <http://support.amd.com/en-us/contact/email-form>, referring to this thread and the Kernel Bug.  [Service Request: {ticketno:[8200794428]}]

I assume AMD are already aware of the Kernel Bug... but it would be nice to hear whether a BIOS and/or microcode fix is coming, or whether the Linux Kernel folk should be persuaded to incorporate a permanent "do not use C6" for Ryzen CPUs (assuming that's really the answer !).

0 Likes

I most likely will just buy Ryzen 2 this spring, since I don't think the fix is coming any time soon. But time will tell.

I also might end up building the kernel with CONFIG_RCU_NOCB_CPU=Y.

0 Likes

Just built the kernel with CONFIG_RCU_NOCB_CPU=y, and running it OK so far (kernel boot parameter: rcu_nocbs=0-15). CPU temperature is slightly higher than before (I assume it does less of core parking or may be some other thing is affected).

But that's what I'd expect, the freeze happens because of insufficient power delivery in some of the C6 states. So somehow rcu_nocbs=0-15 increases that power (while C6 states are still enabled), so this should raise the temperature somewhat.

I have now heard back from TECH.SUPPORT@AMD.COM.  What they said, and what I have said in reply is shown below.

I can today say that my machine has not crashed in the last 7 days.  I cannot say how disappointed I am that this is cause for celebration :-(

                                                                                                                                         

On 28/02/18 10:51, TECH.SUPPORT@AMD.COM wrote:

Dear Chris,

Your service request : SR #{ticketno:[8200794428]} has been reviewed and updated.

Response and Service Request History:

Thank you for the response.

I understand you are experiencing an issue on your PC with Ryzen processor when C6 state is enabled on the BIOS.

Yes, my PC freezes at random when it is idle.  Typically it will freeze when left overnight, roughly every 2 or 3 days.

Having disabled C6 -- using the 'zenstates.py' script, from https://github.com/r4m0n/ZenStates-Linux -- my machine has not frozen for 7 days.

This issue has been fixed with the latest BIOS updates, but the option to fix it may not be available in all BIOS.

What is the root cause of the issue ?

In what way has it been fixed ?

I request you to update to the latest BIOS and see if you have the Power Supply Control option in the MB BIOS. Try toggling this option between the different settings to see if it fixes it. If the specific option is not available I would suggest you keep C6 off for now.

I have the latest available "PRIME X370-PRO BIOS 3803" from ASUS.  That apparently includes:

  2.Update to AGESA 1000a for new upcoming processors

I understand that means AGESA 1.0.0.0a (?) -- I have no idea what that means, since AMD seem to keep the release notes for AGESA as a deep, dark secret.  A previous BIOSes had (according to ASUS) "AGESA 1071", and before that were "AGESA 1.0.0.6B" and "AGESA 1.0.0.6a"... so I admit to being baffled.

What does this new BIOS "Power Supply Option" do ?

Are you telling me that this is problem with my power supply ?

If so, does this mean I need a better power supply ?

Disabling C6 is not really a long term solution... since that disables both (a) the maximum single core performance, and (b) the minimum power consumption state.  While these are arguably marginal, I have wasted a lot of time and energy trying to get my machine to work reliably.

I am seriously disappointed that the only information available is buried in kernel bug report(s) and in various support forums.

Having (eventually) found Linux Kernel Bug 196683, I have been hoping that AMD would leap into action to: (a) inject some clarity into the discussion, and (b) provide a proper solution.

<sigh>

For completeness, let me repeat my questions:

  1) What is the root cause of the issue ?

  2) In what way has it been fixed ?

  3) What does this new BIOS "Power Supply Option" do ?

  4) Are you telling me that this is problem with my power supply ?

  5) If so, does this mean I need a better power supply ?

Thanks,

Chris 

Best regards,

HK

AMD Global Customer Care

_____________________________________________________________________________________________

The contents of this message are provided for informational purposes only.  AMD makes no representation or warranties with respect to the accuracy of the contents of the information provided, and reserves the right to change such information at any time, with or without notice.

0 Likes

> Update to AGESA 1000a for new upcoming processors

That's what I see also for my ASRock X370 Taichi (except no version is specified): Update AGESA for future coming processors.

It didn't even update the microcode it seems. And where is that option for power supplies exactly? Let's hope some future update will actually fix it, but for now building the custom kernel and not disabling C6 looks like the best possible workaround.

0 Likes

Can you please also ask AMD, what exact AGESA version is providing the fix?

0 Likes

I received this from "AMD Support" on 01-Mar-2018 (at 11:50):

_______________________________________________________________________

  This is an automatically generated email please do not reply

  Dear Hall,

  Your service request SR# 8200794428 has been escalated to one of our

  experts who can better address your questions. We appreciate your

  patience, and thank you for your interest in AMD.

  Best Regards,

  AMD Global Customer Care.

____________________________________________________________________

But have not heard anything more, yet.

On a brighter note, I have now ~12 days without a freeze.  I am now going to turn C6 state back on again.

Chris

0 Likes

"AMD Support" said they had escalated the issue on 01-Mar-2018, it is now 12-Mar-2018... <sigh>.

Having run for ~12 days with C6 disabled, I re-enabled it, and my machine crashed twice in 3 days.

I have now replaced the power-supply, by something newer, more efficient and which claims:

   "Zero-Load design that supports Intel’s Deep Power Down C6 and C7 modes".

With C6 enabled, I rather expect the machine to crash again any day now.

shmerl
Elite

Browsing around my ASRock X370 Taichi firmware settings, I found this one:

Advanced > AMD CBS > Zen Common Options > Power Supply Idle Control.

I changed it from auto to low, let's see if it will help with stock kernel.

0 Likes

Still freezing with "low current idle". Testing now with "common current idle".

0 Likes
shmerl
Elite

I'm now also seeing a lot of these in dmesg:

[11225.078807] x86: Booting SMP configuration:

[11225.078808] smpboot: Booting Node 0 Processor 1 APIC 0x1

[11225.081035] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

[11225.081063]  cache: parent cpu1 should not be sleeping

[11225.081127] microcode: CPU1: patch_level=0x08001129

[11225.081213] CPU1 is up

[11225.081233] smpboot: Booting Node 0 Processor 2 APIC 0x2

And so on for all 16 virtual cores.

0 Likes
shmerl
Elite

That C-state 0x0 not supported by HW happens now always, so it's not related to my test above.

With Advanced > AMD CBS > Zen Common Options > Power Supply Idle Control set to "Common current idle" (instead of auto) in my ASRock X370 Taichi firmware, I didn't get any freezes in a while, so I assume it's a valid workaround.

I noticed what changes after it's set in the firmware, using zenstates.py:

When set to auto:

C6 State - Package - Enabled

C6 State - Core - Enabled

when set to Common current idle:

C6 State - Package - Disabled

C6 State - Core - Enabled

So apparently it disables package C6 state (while keeping core C6 state enabled)! Hopefully it can shed some light on what the problem is. I wonder if Ryzen 2 will be free of this issue.

What exactly is "package" in this context? Is it still part of CPU, or it's something on the motherboard?

0 Likes
daclown
Adept I

This thread is consistent with my experiences. I also note that this problem is more frequently triggered by having both Firefox and Steam open at the same time.
Given that syslog loses the last few minutes of logging leading into the soft lockup, I expect we're looking at some kind of runaway condition where something is getting stuck in a loop somewhere. My Softlocks are often preceded by CPU/GPU panics, but they are not always.

john1000
Adept II

I've been watching this post hoping more information would come to light since I want to build a Ryzen based linux system (I already have a windows one.)  Last Sunday, I decided to start a test.  I built a bare bones Ryzen 1300x system with 4 gigs and Asus X370 Pro.  All the other components are from my old Amd 945 - power supply, case, storage, fans, and nvidia GTX 660 graphics card.

I updated the BIOS to the latest, installed Ubuntu Mate 16.04.3 and did all the updates.  I then installed the supported nvidia proprietary drivers.  It's been running without problems so far.  I want to let it run for two weeks without issue before I move forward and buy 64gb of RAM.

I tried this last year and ran into trouble and I describe it here:  Linux Reboots during idle time and MCE errors (Ubuntu 16.04.3.) | johnstechpages.com

0 Likes

What you describe is the Kernel Magic where the:

1) kernel is configured:  CONFIG_RCU_NOCB_CPU=y

2) kernel command-line is used:  rcu-nocbs=0-15 (depending on the CPU)

This appears to help, but is not a guaranteed fix.

AMD have told me that a PSU which will deliver 12V at 0A is required.

I (now) have one of those, but my machine froze again overnight.

There is supposed to be a BIOS fix -- "Power Supply Idle Control" -- but my ASUS Prime X370-PRO does not support it (yet).

The BIOS fix appears to disable the C6 (deep sleep) state, either (a) for the "package" or (b) for all cores.  I assume that by "package" we mean that C6 is enabled for all but one of the cores.  Some people have reported success with these options.

FWIW: I am now trying with C6 disabled for the "package".

I have been hoping for more information from AMD.  I last wrote to AMD "TECH.SUPPORT" on 14-Mar... so far, no reply.

0 Likes

My test system has been running for a little more than a week now, mostly idle with no crashes.  The main differences from last year when I conducted this test are:

1)  Different CPU - an 1300x rather than 1800x.

2)  I've populated only 1 dimm socket with 4 GB of ram rather than 4 sockets with 64 GB.

3)  I'm using a GTX-660 video card rather than a Nvidia 8400 GS.

4)  I'm using the nvidia proprietary driver this time around rather than the open source driver.

5)  Updated BIOS and Ubuntu 16.04.

6)  I'm using a Cooler Master RS 850 Power Supply I purchased in December of 2009 with 8 years of continuous use rather than a new Seasonic Prime 750.

I have not done any of the kernel configurations that are supposed to address this problem, I'm running plain vanilla. We both have the same motherboard and I assume that we both have #5 covered.  What video card do you have and what drivers are you using?  If it's a Nvidia card and you're using the opensource drivers, my only suggestion right now is to switch to the proprietary drivers and start the clock.

It could be that the crash at idle could be more prevalent with more cores.  That's not something I can test until the new 2700X's come out.  I did order the memory yesterday and will likely install sometime during the week or next weekend.  I'll keep you posted.

0 Likes

Update:  I let the machine run 9 days without a restart.  I ran into no issues.  I installed 64GB of ram and ran a memtest for 27 hours with 4 complete passes - no memory errors.  I've been using it ever since.  If I run into any trouble, I'll post here.  I'm likely many weeks away from purchasing a 2700X - I'd like to wait for the official reviews and any BIOS updates first.

0 Likes

Happy to hear your (effectively) new machine is running OK -- without any of the "Kernel Magic".

I have AMD Firepro W2100.  There are rumours of issues with amdgpu, but mostly discounted in favour of C6 issues.

AMD advice notwithstanding, changing to a 0V at 12V PSU did not fix the issue for me.

Using zenstates.py, I have been running with C6 disabled for the "package" for about 14 days now, without a freeze.

So far, ASUS have not shipped the "Power Supply Idle Control" (at least for the Prime X370-PRO).  I asked on 26-Mar, and on 4-Apr "servicecenter_emea@asus.com" were able to tell me:

     No anoucement in regards to this has been made, as of yet

      So we are not able to advise if/when this wil lbe coming

<sigh>

It is rumoured that the "Power Supply Idle Control" options have the effect of disabling C6 either entirely or for the "package", which is what the zenstates.py allows one to do.  However, it is also suggested that tweaking some Overclocking options can also eliminate the freezes.  It may be that somebody clever at AMD has figured out a better way of fixing the problem.

Amongst other things, I would dearly like to know:

  • what, exactly, do the "Power Supply Idle Control" options do ?
  • is it true that disabling C6 altogether will disable the "Max Boost Clock" ?
  • does disabling C6 for the "package" allow "Max Boost Clock" to be used ?
  • what difference do these options make to power consumption ?

I have been hoping for more information from AMD.  I last wrote to AMD "TECH.SUPPORT" on 14-Mar... so far, no reply.

Before sending my original Ryzen away to go through the (two week) RMA process, I bought myself an i7 8700K.  Sadly, only 6 cores... but it does at least work.  Also, it has a built in GPU which does as much as I need (I am neither a gamer nor a Bitcoin miner).  I look forward to the 10nm 8 core version.

0 Likes

Last week, I had two failures during idle, neither left any hints in the logs.  After the first, I replaced the power supply with a Seasonic Prime Titanium 650 and the very next day, it happened again.  I completely agree with imshalla that AMD's explanation that this is related to the power supply not being "haswell" compatible is false.  I updated the BIOS of my Asus Prime X370 to v4008, installed a 2700x yesterday and set the Power Supply Idle Control to get around this issue.  Tonight, I'm going to set the Power Supply Idle control back to default to see if there is an impact to power consumption and frequency range the processor runs at.  I moved one of my VMs to it last night, (elastic search/logstash/kibana stack) and I wonder if that is enough load to prevent this from happening in the future.  Will update this once I answer the power consumption/frequency range question.

0 Likes
aslon
Journeyman III

Could an AMD representative confirm if the 2000 series Ryzens are affected by the random soft lockup bug?

196683 – Random Soft Lockup on new Ryzen build

0 Likes
imshalla
Adept II

I lost interest in my Ryzen 7 1800X machine... but this morning I got round to upgrading the BIOS on the ASUS Prime X370-Pro (to 4011, hot off the press: "Update AGESA 1.0.0.2a + SMU 43.18" whatever that means).

I was running with rcu_nocbs and zenstates --c6-package-disable. That ran for 11 days and then froze.

I last heard from "TECH.SUPPORT@AMD.COM":

  Thank you for the update and confirming that your BeQuite Straight Power 11

  supports 0A minimum load.

  Because your system still freezes, it could be due to cross loading problems

  which can result in the power supply turning off when a load changes or

  result in voltages becoming out of specification causing system crashes and

  hangs. entire CCX or Core complex is taken down.

  There are many levels of power states that a core can be in from C1 to C6,

  CC6 and finally PC6. The Power Supply Idle Control option is designed to

  keep enough current on the rail so that power supply does not go out of

  regulation.

  The Power Supply Idle Control option is part of an AGESA update from AMD

  provided to the motherboard vendors for validation and implementation in

  their BIOS updates. However, it is motherboard vendors decision as this

  which BIOS version will contain the Power Supply Idle Control option.

So now I have options: "low current idle", "typical current idle" and "auto". Neither AMD nor ASUS seem to think it necessary to document what those mean.

I have set "typical current idle". I note that zenstates shows that both C6 States Package and Core are Enabled.

I guess I am back to waiting and seeing.

<sigh>

0 Likes

Hello Everyone,

I was having random lockups with Linux also but I have appeared to fix it.

CPU: AMD Ryzen 7 1700

Motherboard: ASUS Prime X370-PRO

With the 4008 BIOS update I made the following change in BIOS: Advanced Mode -> Advanced -> AMD CBS -> Power Supply Idle Control: Typical Current Idle

After that the computer has stayed running for about a month.

Thanks - "typical current idle" also worked for me - Ryzen 1800X - ASUS Prime X370-Pro MB

0 Likes

Hi imshalla,

I'm also suffering from this bug. I've heard on another bug tracker (I don't have a link, sorry) that overclocking some can help prevent the crashing.

I've overclocked 200mhz and changed to typical current idle, as well as disabled both c6 states (package and core - I've heard that package only isn't enough).

I'm optimistic that this all will fix it, have you had any more crashes with typical current idle?    

0 Likes

So far, uptime 11 days with "typical current idle".   So far, so good.

0 Likes

Today 30 days uptime with the magic "typical" BIOS setting. This is more than twice the previous record.

FWIW: I am so fed up with this machine that I haven't used it since I updated the BIOS and applied the setting. It is running 4.16.5 (Fedora 27), with CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15. I don't know if the rcu_nocbs=0-15 is still required.

Also FWIW: zenstates.py -l tells me that C6 Package is Disabled, but C6 Core is Enabled. Before the BIOS update I used zenstates.py to set C6 the same way, but the machine froze after some 12 days. After the BIOS update I no longer use zenstates.py to set anything. So I guess the BIOS "typical" option disables C6 Package, but also does some other magic.

Mr BeQuiet! are adamant that the Straight Power 11 I have is perfectly happy to supply 0A at all voltages.

Of course, there's a lot of stuff between the PSU and the CPU... so it could be a motherboard issue. Who can tell ?

Possibly, some day, I will go back to using by AMD machine, but I doubt I shall come to be fond of it :-( Certainly I am livid with AMD's abject failure to address the issue promptly, and their continuing inability to discuss or document the issue. Bugs happen. It's how they are dealt with that separates the sheep from the goats. <sigh>

0 Likes

"Typical Current Idle" setting seems to have fixed it for me. Only changed this setting in UEFI, everything else default. Uptime 4 days without an incident, running the latest bios (4012) and kernel 4.18.0-rc7-mainline on asus prime x370-pro with 1700. Had to put up with random freezes ever since I built this system. Upgraded ram to faster ones thinking they could be the culprit. Been updating to the latest bios as soon as they came out and running the latest kernel hoping something would fix it. Finally my system will be stable, crossing my fingers.

0 Likes
spiffy
Journeyman III

Is there any info about if this affects Ryzen 2000 series?

I'm hoping that the Ryzen 5 1600 I bought can be made stable using the ZenStates /  Power Supply Idle Control before I shell out on a bunch of 2600s.

It's been locking up about twice a week when idle for the last 5-6 weeks but the user's only just reported it. User can turn to talk to a colleague and when he turns back the PC has locked up.

Thanks to the useful info on this thread it's currently on day 2 of test with "zenstates.py --c6-disable" so I'll not know for a few more days yet.

As can be seen from the below spec it should be a fairly low power draw on the PSU when idle

Ryzen 5 1600 (stock, no overclock)

Gigabyte A320MA-M.2 (BIOS: F23d 2018-04-17 (latest))

16GB Corsair Vengeance 2400

250 GB SSD

Nvidia GT 710

Corsair CX550 PSU (Supposedly Haswell C6/C7 compliant)

Linux 18.3 Mint Mate/ Kernel 4.13.0-43-generic (latest)

Edit: Typos

0 Likes

Hello all,

These problems seem to be the same we noticed for EPYC CPUs

In general you guys need newer kernels and boot with idle=nomwait

Also look there for more infos why MWAIT didn't worked :

epyc 7551 spontaneously resets after 10mins rendering

BR

0 Likes
imbezol
Journeyman III

I have been having these exact same issues since I got my system. Typically I come to the computer in the morning and it'll be locked up and I have to reset it. On occasion though, and when the system is fairly idle, it will freeze in the middle of using it. Mouse pointer will freeze on the screen.. system doesn't respond to pings any longer.. and I have to reset it. This happens every couple days usually but have had it happen 3 times in a day before. I have been through every BIOS version, custom kernels, and swapped almost every piece of hardware, to no avail.

Asus PRIME X370-PRO

AMD Ryzen 1700

Corsair Vengeance LED DDR4 3200Mhz 2 x 8GB (x2 for a total of 32 GB)

MSI Radeon RX 580 Gaming X 8G

-> upgraded from Asus R9 280x in attempt to fix lockups

Corsair HX1000i PSU

-> upgraded from Corsair TX750W in attempt to fix lockups

Samsung 960 PRO NVMe M.2 PCIe x4 SSD 1TB

-> got rid of all spinning disks in attempt to fix lockups

Also swapped keyboard, mouse, and monitors, got rid of USB devices, etc.

I started with Ubuntu 16.04 and then moved to Arch and now back to Ubuntu and am on 18.04 currently.

0 Likes

Did you try the bios setting everyone has been taking about, or disable the

c6 power state?

0 Likes

Today my Ryzen 7 1800X, on Asus Prime X370-Pro, has 67 days uptime, and has been idle for all that time.

So, I have not suffered a lockup since upgrading to BIOS to 4011 (with "AGESA 1.0.0.2a + SMU 43.18"), and setting the "Advanced Mode -> Advanced -> AMD CBS -> Power Supply Idle Control" option to "Typical Current Idle".

0 Likes

I did run the BIOS 4011 for a while and still had crashes. I am currently running 4012 and came to a crashed system this morning.

I just set the "Power Supply Idle Control" option earlier this morning after seeing this thread so I will update in a couple days if it has helped.

0 Likes

Prior to changing that setting....

> last | grep boot

reboot   system boot  4.15.0-23-generi Mon Jul 16 08:47   still running

reboot   system boot  4.15.0-23-generi Mon Jul 16 05:21 - 08:44  (03:22)

reboot   system boot  4.15.0-23-generi Sat Jul 14 12:51 - 08:44 (1+19:52)

reboot   system boot  4.15.0-23-generi Sat Jul 14 12:08 - 12:32  (00:23)

reboot   system boot  4.15.0-23-generi Fri Jul 13 07:33 - 10:49 (1+03:16)

reboot   system boot  4.15.0-23-generi Wed Jul 11 12:32 - 10:49 (2+22:16)

And now...

09:46:13 up 4 days, 59 min,  2 users,  load average: 0.32, 0.50, 0.55

It's looking promising.

0 Likes

07:13:53 up 11 days, 22:27,  2 users,  load average: 0.73, 0.81, 0.77

Still no crashes! I think the "Power Supply Idle Control" may have solved the issue.

0 Likes
shinobi
Adept I

@jesse_amd jesse_amd​, Would you help us folks, who have been suffering from the idle lockup bug, with Ryzen for so so so long ???

We are asking you to comment because, we find that you seem to have helped the customers with EPYC CPUs, to solve a similar problem.

The bugzilla ticket URL is 196683 – Random Soft Lockup on new Ryzen build

With everything new & stock, we always have to go into the BIOS and disable the "Global C state control" setting in the BIOS, to make the system stable.

else, usually, when the system is idle, it would lockup up. Other findings include, but not limited to setting the Power Supply Idle current to  "common current idle"

We would be glad to have a BIOS firmware fix, so tha,t with that, the system would remain stable with the default BIOS settings.

Even the recent BIOS updates, for instance, the one that updates the firmware to AGESA 1.0.0.2a + SMU 43.18, has not helped.

Kindly help!

0 Likes