cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

samx
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Hi. You can find all the details about my Asus motherboard here : https://www.asus.com/Motherboards/TUF-B450-PLUS-GAMING/overview/

0 Likes
uyuy
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

After about a week since last update, my system which was in trouble is now 99% stable.

It is unfortunate and disappointing and in fact annoying that I had to say 99% instead of 100%, why is annoying is this fault appears to demonstrate some ANALOG nature in a digital computer system. Which is a really rare experience these days, my days with analog electronics already seem like more than a decade ago.

Why I say it is analog is because the following few factors seems to each contribute 20%~30% of the soft lockup problems, and adding/removing each element can increase/decrease your random lockup issues, and even now I still face a slight strange thing or two that I can not explain nor find causes, and never happened to my previous server before, and now I can only RESTART my virtual machine to get rid of these problem. One of it is VPN connection - which works OK when fresh started, and remotely connects OK, but once the remote will manually disconnect and later try to reconnect, it will never be successful, unless I reboot the VPN server.

The elements for me so far are:

  • BIOS typical current idle
  • Kernel boot parameter idle=nowait
  • kernel version
  • CPU power state  C6

My suspicion now is, it can still improve and stop bugging me if I replaced the system's 2 year old power supply unit which was originally there with the Intel Asus board which retired / got upgraded to Ryzen.

My disappointment is, I am expecting a DIGITAL nature of this bug, that means, lets say CPU POWER STATE C6 had definitely caused these issue, and disabling that MUST 100% removed all these problems, and became 100% stable. Setting back C6 should also get all the old troubles back, and that element should show a digital ZERO or ONE nature to CAUSE and FIX the bug. It is quite annoying to be NOT IN THIS CASE.

Everything in the above list of elements had been found by me to affect soft lockup issue, and which in all theories not very 100% right! No theory so far can fully explain / convince that this bug is solved.

To make it even more shadowy, my currently final issue - the VPN re-connection, some times can be recovered by command init 6 reboot. Some other times however, need me to power off the virtual machine and power on again! But it is a only a VM and it has snapshot to protect data losses, and backup copies. I did these virtual power cycles without fear of screwing it up. I won't do this to the actual computer.

0 Likes
skull
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

All,

I have commented that I strongly believe that "most" of the strange behavior noted in this thread is the result of power issues to the CPU core(s) I have connected a scope and have some plots but as it appears there is no way to post them here?? I will put them on another web site soon and provide link.

Not sure they tell much as they are taken from a system that to date I have not been able to get to freeze regardless of the Idle Current Setting.

What I have is put together a small test program that cycles between being idle (sleeping for 5-10sec) and quickly waking N (4-64) threads doing a complex workload (a silly calculation) that uses AVX and thereby should get each core up to near Max power consumption.       When running this I see on a scope the Vcore voltage going from very low (<0.4V, I think just charge left in caps) to as high as 1.5V on the test system.    But have not had a freeze on this system or another Ryzen system I tried.

Am curious if anyone out there who has had system freezes can also get them running this and if they are more/less frequent??    Also do they only happen if Idle Current setting a default or also when Typical Idle Current Set.

Below is the source code for this program just copy and paste into an editor, save it as waketest.c and compile with:

gcc -g -mtune=native -mavx -ftree-vectorize -O3 -fopt-info-vec  waketest.c -o waketest -l pthread

To run ./waketest N       ; where N is the number of Threads on your Ryzen CPU for 1700+ this should be 16

--------------------------------- Start of waketest.c -----------------------------

#include <stdio.h>

#include <stdlib.h>

#include <pthread.h>

#include <semaphore.h>  /* Semaphores are not part of Pthreads */

#include <unistd.h>

const int MAX_THREADS = 256;

const int MSG_MAX = 100;

/* Global variables:  accessible to all threads */

int thread_count;

char** messages;

sem_t semaphores[256];

unsigned int sum[256];

unsigned int total;

void Usage(char* prog_name);

void *ProcFull(void* id);  /* Thread function */

/*--------------------------------------------------------------------*/

int main(int argc, char* argv[]) {

   long       thread;

   pthread_t thread_handles[256];

   int testnum;

   if (argc != 2) Usage(argv[0]);

   thread_count = strtol(argv[1], NULL, 10);

   if (thread_count <= 0 || thread_count > MAX_THREADS) Usage(argv[0]);

  

   for (testnum=0;testnum<1000;testnum++) {

      printf("\nWake up test interation %d\n",testnum);

     

      for (thread = 0; thread < thread_count; thread++) {

         // messages[thread] = NULL;

         /* Initialize all semaphores to 0 -- i.e., locked */

         sem_init(&semaphores[thread], 0, 0);

      }

      for (thread = 0; thread < thread_count; thread++)

         pthread_create(&thread_handles[thread], (pthread_attr_t*) NULL,

             ProcFull, (void*) thread);

      int sleeptime=5+rand()%5;

      printf("Sleeping for %d\n",sleeptime);

      sleep(sleeptime);

     

      // wake up N threads very quickly going from idle to near max in less than 1us

      for (thread = 0; thread < thread_count; thread++) {

        sem_post(&semaphores[thread]);  /* let thread go */

      }

      printf("Woke up all threads!\n");

      // wait for all threads to complete

      for (thread = 0; thread < thread_count; thread++) {

        pthread_join(thread_handles[thread], NULL);

      }

      total=0;

      for (thread = 0; thread < thread_count; thread++) {     

         sem_destroy(&semaphores[thread]);

         total+=sum[thread];

      }

      // display silly total

      printf("All Threads Have Ended Total=%u\n",total);

   }

  

   return 0;

}  /* main */

/*--------------------------------------------------------------------

* Function:    Usage

* Purpose:     Print command line for function and terminate

* In arg:      prog_name

*/

void Usage(char* prog_name) {

   fprintf(stderr, "usage: %s <number of threads>\n", prog_name);

   exit(0);

}  /* Usage */

/*-------------------------------------------------------------------

* Function:       Send_msg

* Purpose:        Create a message and ``send'' it by copying it

*                 into the global messages array.  Receive a message

*                 and print it.

* In arg:         rank

* Global in:      thread_count

* Global in/out:  messages, semaphores

* Return val:     Ignored

* Note:           The my_msg buffer is freed in main

*/

// function with silly loop that will get vectorized SSE2 to Max out CPU core

void *ProcFull(void* id) {

   long tid = (long) id;

   int i,n;

   // these arrays should fit in L1$

   unsigned int a[256] __attribute__((aligned(16)));

   unsigned int b[256] __attribute__((aligned(16)));

   unsigned int c[256] __attribute__((aligned(16)));

   unsigned int d[256] __attribute__((aligned(16)));

   unsigned int res[256] __attribute__((aligned(16)));

   for (i=0;i<256;i++) {

      a=rand()%1024;

      b=i*3+7;

      c=i*7+819234;

      d=i*62134+66;

      res=rand()%256;

   }

   sem_wait(&semaphores[tid]);  /* Wait for our semaphore to be unlocked */

   // printf("Thread %ld\n", tid);

   for (i=0;i<0x1000000;i++) {

      for (n=0;n<256;n++) {

         res+=(a*c+b*d);

         a=res+res;

      }

   }

   sum[tid]=0;

   for (i=0;i<256;i++) {

      sum[tid]+=res;

   }

   return NULL;

}  /* Send_msg */

--------------------------------------  End of waketest.c -------------------------------

0 Likes
uyuy
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Well Done!

I may want to test that after I am freed from burdens of several projects. Already have not used my scope for about a decade shipped to warehouse, and don't think it will still work again. May use a digital meter.

Coincidentally I replaced the troubled system's power supply unit from a Corsair RM650 (3yrs old approx) previously used with Intel board, to a 750W Seasonnic Focus FM-750W. I am curious of this would make any difference. So far it is not giving me issue in the last 10hr. In several days I will update if my last chunk of instability is now gone or not. If it became perfect, I will begin to reverse away the other measures one by one for testing. i.e. the Linux kernel boot up parameter (idle=nowait); kernel version; BIOS version; BIOS settings (typical current idle) & C6 power state enable/disable.

As mentioned in my as post, I am annoyed by the issue exhibiting ANALOG NATURE of FAULT in a digital computer, hence the factors each contributed a certain percentage of instability. And ANALOG NATURE points some how towards POWER SUPPLY voltage stability. That is why I upgraded power supply today.

The guy from the shop who sold me the power supply unit today was shocked to learn from me about this Ryzen stability issue. He had been selling Ryzens mostly on Microsoft OS from since AMD made Ryzen available to market, he had not come across this issue. I emailed him the URL to this forum thread.

So far for stability improvement after new power supply I notice a notorious trouble which was quite consistent is gone tonight! Which is a VPN re-connection trouble - this VPN worked fine with Intel, and started to have re-connection issue after changing to Ryzen. It will connect after resetting VPN server. And after deliberate disconnection, it was very hard to get successful re-connection. After changed power supply unit, it seemed to re-connect more easily, almost same as when It was Intel.

0 Likes
skull
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Uyuy,

It sounds like you have experienced just about every problem I have heard reported.

Your point about problems being deterministic is well taken but remember that although the flow of software is mostly deterministic the timing of when things happen is not and many hidden software problems come from timing issues esp in modern multi-threaded programs.

The only issue I have had that is Ryzen specific is these dam random hard locks (in my case) on every Ryzen system to date that is not running with the CPU voltage (VRM) set at a fixed 1.40V in the BIOS and over clocking enabled (which should disable the onboard voltage regulation in the Ryzen chip).    These hard locks though on most systems have been rare (occurring less than once a month) on average.

The problems you note seem to be occuring quite often and I have not seen any frequent issues.

I will say though on PSU subject it does seem to matter:    I put together a threadripper (16 core system) about 6 months ago.     Originally I had a lame 650W supply in it and was seeing hard locks about twice a week and not just while idle.    I switched after about 2 months to a much better 850W supply and have not had any hard locks (did have two desktop freezes due to AMDGPU) but was able to ssh in and reboot so I know it was not a hard lock.

Power issues can cause other problems for instance many years ago a flaky 3.3V rail caused strange networking problems.    So with power problems different people may see different things happen usually appearing as random instabilities.

0 Likes
imshalla
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

The waketest is interesting, but I have my doubts about it...

My experience of the "freeze-while-idle" is on a machine which overnight is doing next to nothing, and when it did freeze I would see absolutely nothing in the log after some essentially random time in the morning.  The machine did not freeze every night.  Apart from cron jobs, the machine would be woken up fairly regularly by spurious ssh login attempts from the outside world (background level attempts to break into the system).  The longest period when nothing was logged was generally 7 to 10 minutes.  I assume that in those periods the machine was, indeed, completely idle.

Note that in my experience the machine was not heavily loaded when it did wake up.  Which is not what your test sets out to do.

Further, your test is only going to sleep for 5 seconds.  I don't know how long it takes for a given Linux Kernel to send the CPU into the deepest of deep sleeps, but I wonder if 5 seconds is long enough ?  I note that you find that's long enough for the voltage to drop right down to 0.4V, which sounds like deep sleep, though I note also that you have not managed to trigger a freeze while running the test.

-------------------------------------------------------------

Of course, I assume Mr AMD knows what the problem is, so we are all wasting our time trying to work it out 😞

My feeling is that most of the issues that uyuy​ is reporting are *not* related to the problem which "Typical Current Idle" addresses.  Indeed, I suspect that a variety of "lockups", nothing to do with the CPU, are being wrongly attributed to a suspected bug in Ryzen CPUs.

amdmatt​ are you there ?  I suggest that a little openness about the bug addressed by "Typical Current Idle" would be better PR !  Silence is not only alienating your customers, but also causing the Ryzen CPU to be blamed for "lockups" which are not its fault !

0 Likes
rbkreckel
Journeyman III

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

For the record: I have been suffering from the "lock while idle" problem for a while and setting "power supply idle control" to "typical current idle" on my ASRock X470 solved it. Out of curiousity, I tried updating to the lastest BIOS (AGESA 1.0.0.6) and Linux kernel 4.19.12 and tried it without "typical current idle": it did lock up again.

My conclusion: This problem is still unsolved on the BIOS/kernel side and "typical current idle" still the only workaround with no insight or support from AMD whatsoever.

This may sound harsh, but, please, do not hesitate to correct me if I'm wrong. (I still hope I am.)

0 Likes
ruspartisan
Adept I

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

We decided to get 3 Ryzen 2700x machines for testing at work, and we are bitterly disappointed that they randomly freezes in Linux

The only message I get in journalctl is "rcu_sched detected stalls blablabla", and sometimes I don't get even that. I test the system by compiling QEMU with -j16 in a loop (so it has some time of load and some time of idle).

Things I've tried:

- Typical Current Idle

- idle=nomwait

- recompile kernel and add rcu_nocbs=0-15, kernel versions 4.15.0-43 and 4.20

- zenstates --disable-c6

- processor.max_cstate=3 (not even 5, just to be sure)

- set SoC at 1.1v instead of default 0.8-0.9

Configuration:

2700x

4*16gb Ram. It is rated at 3000MHz, but I set them to 2133, just to be sure (lockups occur much faster at 3000MHz, literally minutes)

ASRock B450 Pro4 at one pc and ASUS X470 Pro in two others.

Samsung 860evo 250Gb

Aerocool KCAS 650 80+ Gold PSU

GT1030

I've tried using B450 Pro4 in Windows, and it's already at 48 hours of uptime, while Linux will freeze after 1-5 hours. I'm currently trying another PSU (800w platinum), but I don't expect it to help either. I have no idea what to try next

0 Likes
uyuy
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Hi Skull,

I have a verity of problems but nearly never got the WORST which I had read here and elsewhere, which is DEAD-LOCK-UP that will require motherboard reset button or power button to unlock. Almost 95% of the freeze are unlocked by my favorite Kubuntu SSH root command systemctl restart sddm . Doing this disrupts my server's VMs, so I hate this. In some rare cases I lost even SSH.

Main trend of my lockups are proven by the kubuntu's ksysguard (see Picture URL below)  and linux top command that none of my logical nor physical CPU core really stayed at 0% ussage - they are all still alive! But some threads just got FROZEN!

ksysguard — imgbb.com  ksysguard.jpg

Top-command1.png

In very rare occasions - happens if system is left unattended for long hours - I will even lost SSH, I think this is due to more and more idle tasks got frozen. I need to hit the motherboard / CPU casing reset button.

I suspect it was some bad kernel scheduler issues that frozen some of my NON-IDLE threads - actually nightly backup which is I/O & memory & CPU intensive. This was version 4.18.0-X kernel and after changing the kernel the nightly backup threads don't freeze any more - so I concluded that this part at least can not be blamed on Ryzen CPU.

An update to my system stability days after changing power supply unit to 750 watt, I say there is not very significant improvement. The existing (only) issue of re-connection difficulties for VPN server, only improved slightly, not totally gone. The newest discoveries in this issue suggest that it is not so likely to be caused by Ryzen, I need a week more to further discover.

My question to SKULL is HOW would I find the right measurement points on motherboards for VRM output? I don't have the markings on motherboard, so best chance is only using multi-meter to probe round surface-mounted capacitors and GUESS-WORK. And further I had read somewhere that Ryzen (at least some) have own V-core regulators ON-CHIP??

What do you all think when threads are frozen (not responsive) and yet CPUs are busy working on some other threads? It can not be considered as CPU core locked up right? If the ksysguard CPU usage graphs showed NONE of CPUs are stuck @ ZERO%...

Consider this too: If some CPU cores were frozen in e.g. C6 power state, wouldn't the threads get assigned to other CPUs and continued to work? I mean in this scenario system just slow down from say an 8 core (HT to 16 threads) CPU drop 1st to 7 core (14 threads) then may worsen down to 6/5/4/3/2/1 core? And finally DEAD @ ZERO core?

In my own case I am not see the above scenario at all! I am seeing all my 16 logical (8 physical) cores alive. Just some threads randomly freezes. And some of the freezes will unfreeze itself - this is the case in my nightly backup crontab job - it delayed beyond 12 hours and got COMPLETED.

I had not seen any feedback after posting here, regarding weather gdm (for ubuntu) or sddm (kubuntu) systemctl restart gdm or  systemctl restart sddm are found helpful to unlock frozen system by any other users?? Pse post if you found this useful. Otherwise, I am assuming that your systems all locked up until SSH console are all dead??

0 Likes
uyuy
Adept II

Re: Ryzen linux kernel bug 196683 - Random Soft Lockup

Hi imshalla,

If most other users are getting 100% TOTAL DEAD DEAD LOCKUPs which only motherboard reset button / power button can unlock.... then yes, my issue mostly have some other additional reasons.

In my case setting BIOS to prevent C6 and setting kernel boot to idle=nowait and BIOS to typical current Idle - all did some percentage of improvements to my system stability - which I am amazed by this ANALOG BEHAVIOR! I am expecting digital behavior but not getting it. What I mean is, I expect to see a drastic effect with one particular factor e.g. C6 power state - which will freeze if enabled and completely OK if disabled. But this is not the case in my system! In my system doing that change will only change it between bad & worst - worst when C6 enabled.

The analog characteristic made me believe in power supply voltage stability - which is analog itself. Then the theory logically be that these other factors contributes to drastic CPU core power voltage fluctuations which destabilized the system - and thus eliminating each such factors slight helped my stability. This theory led me to replace my power supply unit.

0 Likes