cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

uyuy
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

To do that, I need a day or 2 when company are having vacation / holiday, and I can shut off their servers. Put a spare disk inside that system then I can do it.

Or I until I setup a spare server to swap up this Ryzen system. There is already a plan to have this spare server next month or April, after a trip overseas.

0 Likes
errsta
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

your docker script locked up my system  

What did you want to look at (syslog/messages/docker logs/??)?

this is from syslog:

Header 1
Mar  1 16:43:50  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=2158  
Mar  1 16:43:50  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=2159  
Mar  1 16:44:54  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:44:54  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=6553  
Mar  1 16:44:54  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=6554  
Mar  1 16:45:57  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:45:57  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=10907  
Mar  1 16:45:57  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=10907  
Mar  1 16:47:00  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:47:00  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=15299  
Mar  1 16:47:00  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=15300  
Mar  1 16:48:03  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:48:03  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=19557  
Mar  1 16:48:03  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=19557  
Mar  1 16:49:06  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:49:06  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=23766  
Mar  1 16:49:06  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=23766  
Mar  1 16:49:53  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:49:53  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=28155  
Mar  1 16:49:53  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=28155  
Mar  1 16:49:53  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [atop:15084]
Mar  1 16:50:21  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [atop:15084]
Mar  1 16:50:49  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [atop:15084]
Mar  1 16:51:12  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:51:12  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=33029  
Mar  1 16:51:12  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=33029  
Mar  1 16:51:13  kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:2:30442]
Mar  1 16:51:17  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [atop:15084]
Mar  1 16:51:41  kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:2:30442]
Mar  1 16:51:45  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [atop:15084]
Mar  1 16:52:09  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:52:09  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=37651  
Mar  1 16:52:09  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=37651  
Mar  1 16:52:09  kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:2:30442]
Mar  1 16:52:13  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [atop:15084]
Mar  1 16:52:37  kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:2:30442]
Mar  1 16:52:41  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [atop:15084]
Mar  1 16:53:05  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:53:05  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=42517  
Mar  1 16:53:05  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=42517  
Mar  1 16:53:05  kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:2:30442]
Mar  1 16:53:09  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [atop:15084]
Mar  1 16:53:33  kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:2:30442]
Mar  1 16:53:37  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [atop:15084]
Mar  1 16:54:01  kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:2:30442]
Mar  1 16:54:05  kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  1 16:54:05  kernel: rcu:   14-...0: (1 GPs behind) idle=f9e/1/0x4000000000000000 softirq=13830753/13830757 fqs=47508  
Mar  1 16:54:05  kernel: rcu:   15-...0: (1 GPs behind) idle=ee2/1/0x4000000000000000 softirq=6372118/6372121 fqs=47508  
Mar  1 16:54:05  kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [atop:15084]


Let me know if there's something specific you'd like to see...

ruspartisan
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Just the fact that your system froze was enough. I see a lot of people saying that some option or setting helped them, but so far the only thing that definitely helps is disabling smt. This docker just recreates the problem on any system (including windows!), And I don't believe that the problem is with docker, because native xubuntu 18.04 freezes in the exact same way.

errsta
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

For my specific use case, the workaround in my previous post seem to be "good enough".  I'd done two multithreaded kernel compiles with no problems and my uptime was far better than anything I had experienced prior to the workarounds.

Hopefully it is legitimately fixed at some point, though, as there is clearly an issue that needs to be fixed (not just worked around). Thanks ruspartisan‌ !

0 Likes
shinobi
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Just chimed in to say that there is a newer BIOS for ASUS X370-PRO Mobo users

Version 4406
2019/03/11
10.24 MBytes
But then they recommend to update to AMD chipset driver 18.50.16 or later before updating BIOS.
Not really sure about how it relates to in a Linux environment.
Then, It is also interesting to note that the BIOS size has doubled, from 5.87 Mbytes in 2017 to 10.24Mbytes in 2019 !
0 Likes
skull
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

News on the Ryzen stability front.

I recently updated three systems in the field that were the most problematic of over a dozen systems we have deployed.    See my previous posts.

In one case the system that was freezing about every two weeks I replaced the MOBO and power supply keeping the same CPU.    The MOBO was a newer model with X470 chipset.    I updated the BIOS and turned on "Typical Idle Current"   running the same workload it has been running for months it has frozen up twice in past 6 weeks so essentially the same as before.    One make one think that the particular CPU may be the issue again see my prevoius post.

In the second case was another almost identical Ryzen system at same site.     It has only rarely locked up (2 times in past 6 months) I just updated the BIOS to latest and turned on Typical idle current.     It has however locked up once again since this update.

In the third case of a system that would freeze up every couple of months (typical and shocking for all non-overclocked moderately loaded Ryzen systems I have experience with)  I updated the BIOS and enabled modest overclocking as this has seemed to stabilize Ryzen systems (again see my prior posts) it has not locked up since I enabled this overclocking a few weeks ago but a bit too early to make conclusions.

I will also note that over the past 3 months I have 3 Ryzen systems here and have presented them with various workloads from doing nothing to CPU mining I have only had one freeze and that was one of the systems running a test where it compiled the Linux kernel using multiple threads repeatedly.    So not enough data to make conclusions about workload effects but as 2 of the three systems were powered on but doing nothing most time for many months prior with no lockups it seems that the lockup problem I am noting here is not the Idle problem but something else?

    

All the systems I noted in this post are running at nominal CPU temperatures even when heavily loaded.

Lastly I will note that of the 3 Threadripper systems we have deployed running similar workloads none have locked up in about 30 months of operation.    So something must be different about Threadripper or the VRM on typical Threadripper MOBO's.

It would be nice if someone from AMD would weigh in on this as I have seen many posts not so much here but elsewhere noting similar reports of random infrequent lockups of Ryzen systems.     I assume the reason most users have not reported this is that most desktop systems have light or no load most of the time and therefore the average frequency of this problem for most is probably many months between lockup events.

0 Likes
skull
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

To the handful of others that have been RECENTLY posting on this thread.

It appears that most of the posts have morphed from the original Idle problem that seems solved with the "typical Idle Current" BIOS setting to the same thing I am noting of random crashes under various "server type" workloads.

I sort of think we are all seeing the same thing just with varying frequency.   It sounds like some can repeat it within days not something I have achieved. 

For those that can frequently repeat complete system freezes (not software crashes) I would be very curious to know if you could set modest overclocking in the BIOS (by just a few percent of rated speed) but with a constant core voltage of 1.40V this setting should ensure that the VRM stays at this constant voltage.

If done right your CPU will run warm, about 50C at idle.

But I am curious to see if this stabilizes your system under the workloads you run?

0 Likes
uyuy
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Dear All,

I am back here today to provide another update.

My system during the past 40 days or so had been 100% stable without any lockup.

It is thus conclusive that the followings are crucial solutions to my case:

  • BIOS UPDATE to disallow C6 CPU Power State.
  • Linux Kernel Bootup Parameter idle=nomwait = this MUST BE in every virtual machine without exception!
  • Linux Kernel version other than 4.18.xx Ubuntu - this is NOT a Ryzen specific issue, happens to my Intel as well!

I had tried things like changing power supply unit, cooling fan... and found these irrelevant.

The irony now is AMD local distributor offered me a TIME LIMITED replacement within the next few days, and I am quit unsure, because their reply to my question on weather this replacement unit is one that AMD had solved the mwait & C6 lockup issue - they said they have NO IDEA, but this replace unit offered to me belongs to their NEWEST BATCH, much much newer version than my (packing box serial number was given to them for RMA before).

There are 3 possible outcome to expect when I changed it:

Optimistic Case = solved C6 & mwait lockup perfectly

Waste-Time Case = nothing changed - still exactly the same old way

Nightmare Case = opening a whole new can of worms, and bugs for me to waste time debugging!

However, before changing, I would like to TEST SKULL's program.

I searched my gmail for a zip file which I believe I had previously received. But not found. Can pse highlight to me where to download it, and some basic instructions how to compile & test them, before I replace the CPU chip with AMD distributor.

Thanks

uyuy

0 Likes
uyuy
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Screen shot Docker NOT FOUND

I tried to run the Docker but failed to run, this was from the ZIP file which URL was posted here.

Today I exchanged the Ryzen RMA with local AMD distributor. Have not tested weather it will hang with C6 or without the idlel=nomwait boot parameters yet. Will try within several days when I got time. Will update to this forum. So far the exchanged RMA CPU booted up and running all the existing things as before RMA exchange.

0 Likes
bash64
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

The rcu_nocbs=0-23 and processor.max_cstate=5 boot line options completely stopped all crashing in linux mint 19.2 on my threadripper gen 2 new build. I was going nutz.

0 Likes