cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

Processors

imshalla
Adept II

Ryzen linux kernel bug 196683 - Random Soft Lockup

I have Ryzen 7 1800X, ASUS Prime X370-PRO, running Fedora 26 and 27.

The damn thing has not worked properly since I bought it.

My first CPU was RMA'd, and the replacement does not appear to suffer from the SEGV fault.

However, both the original and the replacement CPU both crash regularly:

  * occasionally with streams of "watchdog: BUG: soft lockup" events being logged,

  * but mostly the system just stops and I can find no logging that tells me why.

At bugzilla.kernel.org I find bug 196683, where a "workaround" is suggested:

1) kernel configured:  CONFIG_RCU_NOCB_CPU=y

2) kernel command-line:  rcu-nocbs=0-15

But with kernel 4.14.18-300.fc27 I find the machine has stopped over night (when it is idle), every two or three days.

I have added kernel command-line "processor.max_cstate=5", which may help with the crashes, but (I assume) not with the electricity bill 😞

Does anybody understand what the real fault is ?  A "workaround" is all very well, but not entirely satisfactory.  It's not as if this is a new device any more.

0 Kudos
122 Replies
shmerl
Elite

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

I was bitten by the same issue on Linux all the way from the time I bought Ryzen CPU (Ryzen 7 1700X). First I had this one and segfault bug. I RMA'd the chip, and segfault was gone, but random freezes / reboots never went away really. I blamed it first on motherboard, RAM and even power spikes, but nothing helped and now I found that kernel bug opened:

196683 – Random Soft Lockup on new Ryzen build

Some commented there, that AMD know about the issue and plan to fix it with microcode update. However it wasn't fixed in microcode 0x08001136 as some reported. And my motherboard (Asrock X370 Taichi) still ships even older one: 0x08001129.

Are you still planning to fix it, or the only way to do it is to replace the CPU with Ryzen 2?

imshalla
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

For completeness...

...one of the theories is that disabling cstate 6 contributes to preventing random freezes.  My experience is of finding the machine frozen in the morning, when it has been idle overnight.  So some problem with the "Deep Power Down" state seems probable.

...however, the kernel-command-line option ""processor.max_cstate=5", does not appear to disable (cstate) C6 😞  I am unable to determine what (if anything) it does do.

...but at ZenStates-Linux​ there is 'zenstates.py' which will disable C6.  And (with any luck) that may do the trick.

It is, of course, disappointing that my shiny Ryzen CPU is unreliable.  But it is *infuriating* that nobody seems to know what the problem is, or whether some or all of the suggested spells are required to avoid these apparently random freezes.

0 Kudos
shmerl
Elite

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Is there a point to open a support ticket for this?

0 Kudos
imshalla
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

I have sent a message via <http://support.amd.com/en-us/contact/email-form>, referring to this thread and the Kernel Bug.  [Service Request: {ticketno:[8200794428]}]

I assume AMD are already aware of the Kernel Bug... but it would be nice to hear whether a BIOS and/or microcode fix is coming, or whether the Linux Kernel folk should be persuaded to incorporate a permanent "do not use C6" for Ryzen CPUs (assuming that's really the answer !).

0 Kudos
shmerl
Elite

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

I most likely will just buy Ryzen 2 this spring, since I don't think the fix is coming any time soon. But time will tell.

I also might end up building the kernel with CONFIG_RCU_NOCB_CPU=Y.

0 Kudos
shmerl
Elite

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Just built the kernel with CONFIG_RCU_NOCB_CPU=y, and running it OK so far (kernel boot parameter: rcu_nocbs=0-15). CPU temperature is slightly higher than before (I assume it does less of core parking or may be some other thing is affected).

But that's what I'd expect, the freeze happens because of insufficient power delivery in some of the C6 states. So somehow rcu_nocbs=0-15 increases that power (while C6 states are still enabled), so this should raise the temperature somewhat.

imshalla
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

I have now heard back from TECH.SUPPORT@AMD.COM.  What they said, and what I have said in reply is shown below.

I can today say that my machine has not crashed in the last 7 days.  I cannot say how disappointed I am that this is cause for celebration 😞

                                                                                                                                         

On 28/02/18 10:51, TECH.SUPPORT@AMD.COM wrote:

Dear Chris,

Your service request : SR #{ticketno:[8200794428]} has been reviewed and updated.

Response and Service Request History:

Thank you for the response.

I understand you are experiencing an issue on your PC with Ryzen processor when C6 state is enabled on the BIOS.

Yes, my PC freezes at random when it is idle.  Typically it will freeze when left overnight, roughly every 2 or 3 days.

Having disabled C6 -- using the 'zenstates.py' script, from https://github.com/r4m0n/ZenStates-Linux -- my machine has not frozen for 7 days.

This issue has been fixed with the latest BIOS updates, but the option to fix it may not be available in all BIOS.

What is the root cause of the issue ?

In what way has it been fixed ?

I request you to update to the latest BIOS and see if you have the Power Supply Control option in the MB BIOS. Try toggling this option between the different settings to see if it fixes it. If the specific option is not available I would suggest you keep C6 off for now.

I have the latest available "PRIME X370-PRO BIOS 3803" from ASUS.  That apparently includes:

  2.Update to AGESA 1000a for new upcoming processors

I understand that means AGESA 1.0.0.0a (?) -- I have no idea what that means, since AMD seem to keep the release notes for AGESA as a deep, dark secret.  A previous BIOSes had (according to ASUS) "AGESA 1071", and before that were "AGESA 1.0.0.6B" and "AGESA 1.0.0.6a"... so I admit to being baffled.

What does this new BIOS "Power Supply Option" do ?

Are you telling me that this is problem with my power supply ?

If so, does this mean I need a better power supply ?

Disabling C6 is not really a long term solution... since that disables both (a) the maximum single core performance, and (b) the minimum power consumption state.  While these are arguably marginal, I have wasted a lot of time and energy trying to get my machine to work reliably.

I am seriously disappointed that the only information available is buried in kernel bug report(s) and in various support forums.

Having (eventually) found Linux Kernel Bug 196683, I have been hoping that AMD would leap into action to: (a) inject some clarity into the discussion, and (b) provide a proper solution.

<sigh>

For completeness, let me repeat my questions:

  1) What is the root cause of the issue ?

  2) In what way has it been fixed ?

  3) What does this new BIOS "Power Supply Option" do ?

  4) Are you telling me that this is problem with my power supply ?

  5) If so, does this mean I need a better power supply ?

Thanks,

Chris 

Best regards,

HK

AMD Global Customer Care

_____________________________________________________________________________________________

The contents of this message are provided for informational purposes only.  AMD makes no representation or warranties with respect to the accuracy of the contents of the information provided, and reserves the right to change such information at any time, with or without notice.

0 Kudos
shmerl
Elite

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

> Update to AGESA 1000a for new upcoming processors

That's what I see also for my ASRock X370 Taichi (except no version is specified): Update AGESA for future coming processors.

It didn't even update the microcode it seems. And where is that option for power supplies exactly? Let's hope some future update will actually fix it, but for now building the custom kernel and not disabling C6 looks like the best possible workaround.

0 Kudos
shmerl
Elite

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Can you please also ask AMD, what exact AGESA version is providing the fix?

0 Kudos