1 2 3 4 5 6 1,895 Replies Latest reply on Jun 14, 2018 1:36 PM by constantinx Go to original post
      • 45. Re: gcc segmentation faults on Ryzen / Linux
        skuto

        I'm seeing the same. Gigabyte B350 Gaming 3, Ryzen 1700, Linux kernel 4.10.11. They occur about once per week.

         

        May 29 15:15:23 beast kernel: [1193216.141676] mce: [Hardware Error]: Machine check events logged

        May 29 15:15:23 beast kernel: [1193216.141684] [Hardware Error]: Corrected error, no action required.

        May 29 15:15:23 beast kernel: [1193216.141689] [Hardware Error]: CPU:9 (17:1:1) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000b0151

        May 29 15:15:23 beast kernel: [1193216.141693] , Syndrome: 0x000000004a000000, IPID: 0x000100b000000000

        May 29 15:15:23 beast kernel: [1193216.141695] [Hardware Error]: Instruction Fetch Unit Extended Error Code: 11

        May 29 15:15:23 beast kernel: [1193216.141696] [Hardware Error]: Instruction Fetch Unit Error: L2 BTB multi-match error.

        May 29 15:15:23 beast kernel: [1193216.141698] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

        • 46. Re: gcc segmentation faults on Ryzen / Linux
          sat

          I ran my reproducer, building linux kernel with make -j16, on WSL

          and it failed at random. In addition, I heard from a person that NetBSD caused

          SEGV and kernel panic at radom under heavy compilation workload.

          He also said that this problem disappeared after disabling ASLR. It means

          the probability of hardware problem, I guess it's a Ryzen's problem,

          becomes higher.

           

          * Detailed Information

           

          ** reproduce

          ryzen-problem-repro · GitHub

           

          ** log

          ryzen1800x-linux-kernel-make-j16-on-WSL-agesa1.0.0.6 · GitHub

           

          Although it didn't die with SEGV, but, I consider it's caused by the

          difference between underneath kernel.

           

          * Additional information:

          a. I'll go back to linux and will gather tracing information on SEGV,

              like accessed addresses (both virtual one and physical one), and

              which instruction was executing and so on.

          b. My motherboard is ASUS PRIME X370-pro and my BIOS is the newest

              0612, but, unfortunately unfortunately it seems not to contain

              the newest 1.0.0.6. My BIOS setting is the default and no OC,

              no memtest error.

          c. There was the different error message than "fork: Invalid argument"

              as follows while running the same reproducer with

              the old BIOS (AGESA 1.0.0.4a)

           

          ryzen1800x-linux-kernel-make-j16-on-WSL-agesa1.0.0.4a · GitHub

          1 of 2 people found this helpful
          • 47. Re: gcc segmentation faults on Ryzen / Linux
            atipro

            I have the same problem. Hardware: R5 1600, Asus B350M-A with latest UEFI and 2x8 GB Samsung memory 2400 MHz 17-17-17-39. All settings in UEFI are on default values, I've only switched on SVM support. OS: Ubuntu 16.04 with kernel 4.11.3 from Index of /~kernel-ppa/mainline .

             

            The system is very stable except from random build failures with gcc. For testing I used loop compilation of linux kernel:

            1. All stock and default: 2 build fails in 59 compilations.

            2. ASLR disabled: 0 build fails in 65 compilations. I'll test this more in the future.

            3. SMT disabled: 3 build fails in 13 compilations.

            4. ASLR and SMT disabled: 0 build fails in 33 compilations.

             

            So disabling ASLR does seem to help and disabling SMT doesn't.

            1 of 1 people found this helpful
            • 48. Re: gcc segmentation faults on Ryzen / Linux
              sat

              Random failure of my reproducer still happen on WSL with disabling ASLR.

               

              log.txt.bios0612-aslr-off · GitHub

               

              There are many reasons of error messages. Since WSL is a blackbox for me,

              I can't know why such kind of messages are shown on WSL compare with

              a SEGV message on Linux and why it failed with disabling ASLR.

               

              NOTE: I confirmed the memory map is changed for each process creation with

              enabling ASLR (default) by seeing /proc/<pid>/maps. In addition, it is not

              changed for each process creation with disabling ASLR.

              • 49. Re: gcc segmentation faults on Ryzen / Linux
                psiedler

                I'm running a Ryzen R7 1800X on an Asus X370 Pro with 32GB 2400 MHz RAM, BIOS 0612, no overclocking at all.

                My system is a "silent" one, completely passively cooled - see "Leise PC - PC Silentium! AMD - Hauptkomponenten " for reference. Under constant compile load, sensors indicate that the CPU temperature never exceeds 75°C.

                I can compile kernel sources with gcc-7.1 "make -j 16" for many hours in a loop, without ever encountering core dumps.

                However, I do encounter segfaults seldom (like every few days) at the very beginning of any highly-parallel process suddently utilizing all 16 threads to their max, but that only seems to happen when the CPU was idle before - it does never happen once the CPU got stressed for at least 10 seconds or so before. As if the sudden ramp-up in power use or temperature was the culprit.

                • 50. Re: gcc segmentation faults on Ryzen / Linux
                  alfonsor

                  Days are passing and still no answer. The discovery that disabling ASLR alleviates and quite removes all segfaults for the majority of the users should be both an evidence that the problem is real and a starting point to solve it, not a fix, just a temporary workaround. AMD, what is the real problem? Can we hope to have a fix in ryzen (1)?

                  • 51. Re: gcc segmentation faults on Ryzen / Linux
                    space

                    @AMD: Please check the gentoo forums where a technical discussion is going on about this issue:

                     

                    https://forums.gentoo.org/viewtopic-p-8075980.html#8075980

                     

                    Edit: Quote from that post: All this points to a possible bug in Ryzen's micro-op cache perhaps triggered by "CMP/TEST conditional jump" instruction fusion μops.

                     

                    Best regards,

                     

                       Space

                    • 53. Re: gcc segmentation faults on Ryzen / Linux
                      mcl00

                      I may have have isolated my issue to the motherboard rather than the CPU itself. I got my hands on a Ryzen 5 1600X and ASRock Taichi x370 MB (BIOS P1.60 - pre-AGESA-1.0.0.4a) so I was able to do some mixing and matching of components.

                       

                      I swapped out my R7-1700 for the R5-1600X and was able to reproduce the compiler segfaults at default settings very rapidly.

                       

                      I then swapped out my motherboard (MSI x370 Gaming Pro Carbon) for the ASRock x370 Taichi and installed the R5-1600X in that board. I loaded the BIOS defaults and ran my tests. The system was significantly more stable while compiling, but it did 'hard' lockup on one overnight run (after 65 loops of compiling mesa-17.0) - i.e. the entire system froze and was unresponsive to the point that I had to power it off.

                       

                      Finally, I updated the BIOS on the ASRock board to P2.30 (AGESA 1.0.0.4a), installed my R7-1700 and ran my tests in that configuration. With default BIOS settings, the R7-1700 was able to compile software all night with no segfault or hard lockup.

                       

                      So I'm going to try to RMA my MSI Board as it seems to be the common denominator in my case for the lockups.

                       

                      As a side note, if you are wondering why I didn't try the R5 with the new BIOS with the 1.0.0.4a microcode update to see if it would fix the hard lock-up issue I encountered, it's because I don't know enough about microcode updates in processors to know whether I could reverse that update. Since the R5 is not mine, I did not want to make any irreversible changes

                      3 of 3 people found this helpful
                      • 54. Re: gcc segmentation faults on Ryzen / Linux
                        foppe

                        Would suggest that you wait for agesa 1006-based bios before returning your GPC. I have a suspicion that this is at least partly caused by power/voltage issues (controlled by the bios, and thus amenable to fixing via bios updates), and agesa 1006 final bioses have been released for most motherboards in the past 48h.

                        2 of 2 people found this helpful
                        • 55. Re: gcc segmentation faults on Ryzen / Linux
                          cl0p3z

                          Microcode updates only last as long as the CPU is powered on.

                          So, they have to be applied each time you boot the system.

                          Usually the BIOS or the OS in early boot stages is responsible to do that.

                           

                          Check: Microcode - Debian Wiki

                          2 of 2 people found this helpful
                          • 56. Re: gcc segmentation faults on Ryzen / Linux
                            whiskey-foxtrot

                            I'm glad you were able to figure it out - or at least come up with a more stable solution. Keep us posted if anything else happens.

                             

                            "As a side note, if you are wondering why I didn't try the R5 with the new BIOS with the 1.0.0.4a microcode update to see if it would fix the hard lock-up issue I encountered, it's because I don't know enough about microcode updates in processors to know whether I could reverse that update. Since the R5 is not mine, I did not want to make any irreversible changes"

                             

                            As for the OS-side microcode updates, those are generally store in /etc/firmware; besides that, support for various processors in the same line are grouped in that .dat file.

                            1 of 1 people found this helpful
                            • 57. Re: gcc segmentation faults on Ryzen / Linux
                              mrwwhitney

                              The reasoning behind the DragonFly BSD patch is located on page 7 see post by Matt Dillon

                              Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads - Phoronix Forums

                              1 of 1 people found this helpful
                              • 58. Re: gcc segmentation faults on Ryzen / Linux
                                sat

                                > a. I'll go back to linux and will gather tracing information on SEGV,

                                >     like accessed addresses (both virtual one and physical one), and

                                >     which instruction was executing and so on.

                                 

                                I did the above mentioned investigation and got some more information from other Ryzen users.

                                Here is the summary(details are below).

                                 

                                - The prime suspect is still Ryzen

                                - The SEGVs which gcc(cc1) get is caused by unknown General Protection Fault(error code is 0)

                                - The GPFs happened not on the specific CPU, but on the many random CPUs.

                                - There is a case of the successing GPFs on the same CPU in the very short period.

                                - The IPs on SEGV point to the variety of instructions like move, test, jmp, and so on. And they

                                  happened reside on the several narrow memory regions.

                                - The following known two logics can't explain the all SEGV cases.

                                  a. Small regions of dense test/jmp instructions hit uop cache

                                     => Someone said the SEGVs still happened after disabling uop cache

                                 

                                https://www.reddit.com/r/programming/comments/6f08mb/compiling_with_ryzen_cpus_on_linux_causing_random/

                                 

                                b. iretq under SMT destabilize CPU

                                     => Disabling SMT didn't fix this problem at least in my case

                                 

                                Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads - Phoronix Forums

                                 

                                Again, I request AMD to tell us the current progress of this problem with considering

                                the information which users, including me, provided here. Whether it's a Ryzen's

                                problem or not? Whether can you provide the proper workaroud and/or new AGESA

                                with fixing this problem? If you need some help, I can provide information as you need.

                                 

                                I've investigated this issues in detail because this problem is critical for me and AMD

                                showed almost nothing about the progress of this issues inside AMD. If you say "Yes,

                                it's the CPU's problem and we know the root cause," my works are needless, I can

                                finish this work, and it's the best for me. But it's not, unfortunately.

                                 

                                * Detail

                                 

                                Here is what I've done after the last post.

                                 

                                1. Which event caused SEGV?

                                 

                                All of them are General Protection Failure. It can be found by the following

                                kernel official tracer.

                                 

                                Reproducer:

                                ryzen-problem-repro2 · GitHub

                                 

                                A shell script takes kernel's back trace on SEGV:

                                trace-signal.sh · GitHub

                                 

                                Result:

                                ryzen_segv_ftrace_signal_generate · GitHub

                                 

                                2. What kind of GPFs?

                                 

                                - All of them are caused by unknown (error code is 0) reason

                                - They happened not only on the specific CPU, but also many CPUs

                                - At "3289.263026" and "3289.455664", there are two successive GPFs on the same CPU and

                                  on the different processes. It appeared as the two successive SEGVs on the reproducer's log.

                                  It might be that the first GPF, or other event, destabilizes CPU for a short period.

                                - The IPs when GPFs occurred are in the several narrow memory regions. Some of them had

                                  the completely the same addresses.

                                 

                                A kernel patch for linux v4.11.4 which gathers GPFs' information at the kernel level:

                                ryzen-segv-tracer-to-v4.11.4.patch · GitHub

                                 

                                Reproducer's log:

                                ryzen-segv-build-log.txt · GitHub

                                 

                                Kernel log:

                                https://gist.github.com/satoru-takeuchi/f473f7eba08331387032654ad6f3e4dc

                                 

                                3. What kind instructions which were pointed by IPs on GPF and how about

                                the pattern of the instructions near that IPs?

                                 

                                - There are many kinds of instructions. If I'm forced to say, test and jmp happened many times.

                                - It happened on executing the instructions don't touch any memory, for example jmp.

                                - There aren't not clear patterns between that IPs and its neighbours.

                                 

                                The instructions which were pointed by the IPs on GPFs:

                                ips-on-segv.txt · GitHub

                                 

                                That instructions (marked as '*' at the beginning of the line) and the instructions neighbourhood.

                                ryzen-segv-tracer-log.txt · GitHub

                                 

                                * What I will do the next

                                 

                                Make a reproducer which is more simple and can cause this problem more quick

                                • 59. Re: gcc segmentation faults on Ryzen / Linux
                                  whiskey-foxtrot

                                  Can you post raw output (not using the script) on Git to see POF please?

                                  1 of 1 people found this helpful
                                  1 2 3 4 5 6