5 6 7 8 9 1,895 Replies Latest reply on Jun 14, 2018 1:36 PM by constantinx Go to original post
      • 90. Re: gcc segmentation faults on Ryzen / Linux
        raydude

        I achieved relative stability. I found out that BIOS settings are not working because I disabled SMT but couldn't turn it back on. So I did a CMOS clear. and that fixed SMT.

         

        Then I thought, well if the BIOS is unreliable I need a new benchmark. So I cleared the CMOS again, booted and and did a new benchmark.

        After 10 runs, the average time between failures is 341 seconds

         

        Then I rebooted, cleared CMOS again and the only change I made in BIOS was to disabled SMT.

         

        It has been running for 12 hours without fail.

         

        I think there really is an SMT problem. But I suspect there are other problems as well.

         

        I'm going to crank my memory clock next to see if it can run as fast as it had been but with stability. If it fails, I'll clear CMOS, disable SMT and see if stability returns.

         

        Update: This is horrific. I cleared the CMOS and now I can't get into BIOS. Apparently as soon as I press delete to enter BIOS, USB crashes and my keyboard and mouse stop working. Straight USB keyboard doesn't work either. Front or back ports. It will boot, but I can't change any settings...

         

        Man... Here I thought I was onto something.

         

        Update: I think I damaged the mobo trying to clear the CMOS... I'll see if I can find the damage and repair it today. That's what I get for using a giant screw driver in a tight spot without enough light and in a hurry.

         

        Update: MOBO looks fine. When system is cold it won't post. It resets over and over and will not stop until I hold the power button down and force it off. Then to get it to boot I can clear the CMOS and then it will boot. I can't get into the BIOS settings though...

        • 91. Re: gcc segmentation faults on Ryzen / Linux
          alfonsor

          That is quite what happened to me: it is totally random and surprising; the moment you think you find some stability that is the moment the "bug" re-appears in all its glory. I even bought a brand new hd because I thought my ssds and hds failed. I installed a fresh gentoo stage 3 on it and recompiled it for 3 days without any problem. All of a sudden the bug said "hello" and everything started again. You might think it is a hot temp problem, nope: the segfaults started in the morning after a boot.

           

          So, after 4 mbs, 1 psu, 3 rams and 1 hd, I decided to rma the cpu, send the last mb back and as last hope I am waiting for brand new components to arrive. It is since the middle of may I am fighting with "the bug" and it should have been a production machine (a machine I pay rents from work I do on it).

           

          This situation is really strange.

          1 of 1 people found this helpful
          • 92. Re: gcc segmentation faults on Ryzen / Linux
            ichuev

            Yes, but that's probably another problem. (my mobo/cpu is taichi x370/ryzen 1800x)

            I've been working on debugging this and it seems to be related to kernel, not to CPU.

            Haven't yet been able to reproduce this on vanilla kernel 4.12-rc5.

            There are also (probably?) third problem with MCE (machine check exceptions).

            Crashes persist, though, not depending on kernel version or disabled/enabled SMT.

            • 93. Re: gcc segmentation faults on Ryzen / Linux
              sat

              > * What I will do the next

              >

              > Make a reproducer which is more simple and can cause this problem more quick

               

              Unfortunately I've not created a better reproducer than the current one (building a heavy program).

              However, I have a progress as a result of some experiences.

               

              * Summary

               

              This problem can happen with just one core. So it can be said that this problem is not caused

              just by the interaction with multiple cores.

               

              * Experience

               

              Build a linux kernel (v4.11.5, defconfig) with the following script.

               

              ryzen-problem-repro3 · GitHub

               

              At this time, the number of LCPUs is limited from 16 (my Ryzen 1800X's max) to

              8 (one CCX), 2 (one core 2 thread), 1 (one core one thread).

               

              * Result

               

              SEGV happened at the all cases. When the number of LCPUs becomes smaller,

              the possibility of SEGV becomes lower too.

               

              * What I'll do next

               

              I'll keep trying to make a better reproducer.  Since this problem happens with one core,

              the possibility to create a reproducer becomes higher. Theoretically the workload

              needed to hit SEGV can be get from core dump at SEGV with the current reproducer.

              • 94. Re: gcc segmentation faults on Ryzen / Linux
                sat

                amdmatt bridgman

                 

                Again, AMD, please say something. Why do you ignore us for a long time?

                 

                I'm a long time AMD fan and I'm looking forward you to make x86 CPU market

                more interesting. However, I'm really disappointed with your support about this problem.

                • 95. Re: gcc segmentation faults on Ryzen / Linux
                  amdmatt

                  Checking in from AMD. The vast majority of users using Ryzen for Linux code and development have reported very positive results.

                   

                  A small number of users have reported some isolated issues and conflicting observations.  AMD is working with users individually to understand and resolve these issues.

                   

                  If you are having a specific issue, please raise a service request on: http://support.amd.com/en-us/contact/email-form.

                   

                  Use “Ryzen Linux Forum Discussion” as the subject and an AMD support person will contact you.

                  2 of 2 people found this helpful
                  • 96. Re: gcc segmentation faults on Ryzen / Linux
                    alfonsor

                    Many opened a service request, I did. I had an email where someone asked me to confirm the components of my system and then nothin more. Anyway... You are saying that many people having the very same problem (not conflicting observations, but the very same bug: massive parallel compilation fails) are just unlucky people and there is no problem in the cpu? Can you confirm that?

                    • 97. Re: gcc segmentation faults on Ryzen / Linux
                      tosilva

                      Question: does this problem also occur with AMD graphics cards, or is it restricted to NVidia cards?

                      • 98. Re: gcc segmentation faults on Ryzen / Linux
                        ryzenmaster2017

                        alfonsor I believe what AMD Matt said was many AMD Ryzen users have not had this issue.

                        1 of 1 people found this helpful
                        • 99. Re: gcc segmentation faults on Ryzen / Linux
                          sh0n

                          This issue has occurred on my system with an AMD Rx 460 so I don't believe the issue is GPU-related.

                          1 of 1 people found this helpful
                          • 100. Re: gcc segmentation faults on Ryzen / Linux
                            ahartmetz

                            But can you reproduce it at all at AMD? If not, you should maybe get some consumer grade test setups or so...

                            • 101. Re: gcc segmentation faults on Ryzen / Linux
                              sat

                              Thank you for your reply. I submitted a new support request.

                               

                              > A small number of users have reported some isolated issues and conflicting observations.  AMD is working with users individually to understand and resolve these issues.

                              >

                              > If you are having a specific issue, please raise a service request on: http://support.amd.com/en-us/contact/email-form.

                               

                              Please keep in mind the you said as follows in the past.

                               

                              > There is no need to open new tickets on this issue. We are investigating and as soon as there is any updates, i will let you all know in this thread.

                               

                              I'd already opened a previous service request about this problem. After finding this thread, I moved here

                              since you said as mentioned above.

                              • 102. Re: gcc segmentation faults on Ryzen / Linux
                                sh0n

                                Update on my issue:

                                I was able to (hopefully?) resolve my issue by setting the SOC voltage to 1.185v* on my 1800X. I left my computer on overnight recompiling the Linux kernel and it hasn't failed yet. Before this change it would fail within the first few compilations. I am also overclocked to 3.9GHz@1.385vcore, but the issue occurred on stock clocks/voltages too so I'm not sure how relevant that is.

                                 

                                EDIT: fix voltage

                                • 103. Re: gcc segmentation faults on Ryzen / Linux
                                  sven1999

                                  Do you remember what the original value for "SOC voltage" was? What is the exact name in the BIOS (because there is no such value in the latest beta BIOS for ASUS Prime B350-Plus)?

                                  1 of 1 people found this helpful
                                  • 104. Re: gcc segmentation faults on Ryzen / Linux
                                    mcl00

                                    I appreciate the reply. I have opened a support ticket as requested.

                                     

                                    I do find this issue truly maddening since any time I think I've made progress towards resolving it, it pops back up. With my new motherboard, I could no longer trigger the segfaults in a timely fashion with my looping compile script. However, I can do so if I loop through compiling mesa while simultaneously running stress-ng with the --aggressive option. In general, the compiler/libc/sh will segfault within 15 minutes or so in that scenario. But, it will occasionally run for hours before segfault'ing which makes it truly frustrating to try to test any changes. You think you've hit on the solution, then bam, it dies again.

                                     

                                    Recently I have tried:

                                    Upgrading to kernel 4.11.6.

                                    Enabling or disabling NUMA in the kernel

                                    Enabling or disabling ASLR

                                    Using the performance cpu scaling governor (normally I use ondemand)

                                    Recompiling the system (gcc 6.3) with -O2 -pipe -march=znver1 -mtune=haswell compiler flags

                                    Recompiling the system (gcc 6.3) with just -O2 -pipe

                                    "Overclocking" the RAM to 2933 MHz

                                    Overclocking the CPU to 3.7 GHz, VCore 1.25V (N.B. can run prime95 overnight without failure and without the CPU ever exceeding 50 degrees C)

                                     

                                    Interestingly (though perhaps coincidentally), it has always been the compilation task that has segfaulted when I'm doing it with stress-ng running simultaneously. stress-ng always exits with "successful run in XYZ seconds...". For example, here is the most recent output from stress-ng after I had yet again thought I had found a solution but then the dreaded 'internal compiler error' happened after nearly 2 hours of compiling mesa on a loop.

                                     

                                    vladimir ~ # stress-ng --cpu 16 --aggressive --perf stress-ng: info:  [5653] defaulting to a 86400 second run per stressor stress-ng: info:  [5653] dispatching hogs: 16 cpu stress-ng: info:  [5653] cache allocate: default cache size: 8192K ^Cstress-ng: info:  [5653] successful run completed in 6413.21s (1 hour, 46 mins, 53.21 secs) stress-ng: info:  [5653] cpu: stress-ng: info:  [5653]        185,561,835,725,520 CPU Cycles                    28.93 B/sec stress-ng: info:  [5653]        206,017,728,042,416 Instructions                  32.12 B/sec (1.110 instr. per cycle) stress-ng: info:  [5653]                          0 Cache References               0.00 sec   stress-ng: info:  [5653]                          0 Cache Misses                   0.00 sec   stress-ng: info:  [5653]         14,861,412,201,136 Stalled Cycles Frontend        2.32 B/sec stress-ng: info:  [5653]         27,660,103,488,112 Stalled Cycles Backend         4.31 B/sec stress-ng: info:  [5653]         40,164,506,861,184 Branch Instructions            6.26 B/sec stress-ng: info:  [5653]            284,832,819,648 Branch Misses                 44.41 M/sec ( 0.71%) stress-ng: info:  [5653]                     11,616 Page Faults Minor              1.81 sec   stress-ng: info:  [5653]                          0 Page Faults Major              0.00 sec   stress-ng: info:  [5653]                 26,505,696 Context Switches               4.13 K/sec stress-ng: info:  [5653]                 19,112,608 CPU Migrations                 2.98 K/sec stress-ng: info:  [5653]                          0 Alignment Faults               0.00 sec   stress-ng: info:  [5653]                     11,616 Page Faults User               1.81 sec   stress-ng: info:  [5653]                          0 Page Faults Kernel             0.00 sec   stress-ng: info:  [5653]                        976 System Call Enter              0.15 sec   stress-ng: info:  [5653]                        960 System Call Exit               0.15 sec   stress-ng: info:  [5653]                          0 TLB Flushes                    0.00 sec   stress-ng: info:  [5653]                    387,792 Kmalloc                       60.47 sec   stress-ng: info:  [5653]                     14,768 Kmalloc Node                   2.30 sec   stress-ng: info:  [5653]                 10,343,472 Kfree                          1.61 K/sec stress-ng: info:  [5653]                     70,048 Kmem Cache Alloc              10.92 sec   stress-ng: info:  [5653]                     14,784 Kmem Cache Alloc Node          2.31 sec   stress-ng: info:  [5653]                 78,082,688 Kmem Cache Free               12.18 K/sec stress-ng: info:  [5653]                     11,312 MM Page Alloc                  1.76 sec   stress-ng: info:  [5653]                  1,319,952 MM Page Free                 205.82 sec   stress-ng: info:  [5653]                248,618,560 RCU Utilization               38.77 K/sec stress-ng: info:  [5653]                    614,288 Sched Migrate Task            95.78 sec   stress-ng: info:  [5653]                          0 Sched Move NUMA                0.00 sec   stress-ng: info:  [5653]                 18,220,864 Sched Wakeup                   2.84 K/sec stress-ng: info:  [5653]                     19,744 Signal Generate                3.08 sec   stress-ng: info:  [5653]                         16 Signal Deliver                 0.00 sec  stress-ng: info:  [5653]                    961,600 IRQ Entry                    149.94 sec   stress-ng: info:  [5653]                    961,600 IRQ Exit                     149.94 sec   stress-ng: info:  [5653]                 83,290,240 Soft IRQ Entry                12.99 K/sec stress-ng: info:  [5653]                 83,290,240 Soft IRQ Exit                 12.99 K/sec stress-ng: info:  [5653]                          0 Writeback Dirty Inode          0.00 sec   stress-ng: info:  [5653]                          0 Writeback Dirty Page           0.00 sec 
                                    1 of 1 people found this helpful
                                    5 6 7 8 9