3 4 5 6 7 992 Replies Latest reply on Aug 22, 2017 1:51 AM by apache14 Go to original post
      • 58. Re: gcc segmentation faults on Ryzen / Linux
        sat

        > a. I'll go back to linux and will gather tracing information on SEGV,

        >     like accessed addresses (both virtual one and physical one), and

        >     which instruction was executing and so on.

         

        I did the above mentioned investigation and got some more information from other Ryzen users.

        Here is the summary(details are below).

         

        - The prime suspect is still Ryzen

        - The SEGVs which gcc(cc1) get is caused by unknown General Protection Fault(error code is 0)

        - The GPFs happened not on the specific CPU, but on the many random CPUs.

        - There is a case of the successing GPFs on the same CPU in the very short period.

        - The IPs on SEGV point to the variety of instructions like move, test, jmp, and so on. And they

          happened reside on the several narrow memory regions.

        - The following known two logics can't explain the all SEGV cases.

          a. Small regions of dense test/jmp instructions hit uop cache

             => Someone said the SEGVs still happened after disabling uop cache

         

        https://www.reddit.com/r/programming/comments/6f08mb/compiling_with_ryzen_cpus_on_linux_causing_random/

         

        b. iretq under SMT destabilize CPU

             => Disabling SMT didn't fix this problem at least in my case

         

        Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads - Phoronix Forums

         

        Again, I request AMD to tell us the current progress of this problem with considering

        the information which users, including me, provided here. Whether it's a Ryzen's

        problem or not? Whether can you provide the proper workaroud and/or new AGESA

        with fixing this problem? If you need some help, I can provide information as you need.

         

        I've investigated this issues in detail because this problem is critical for me and AMD

        showed almost nothing about the progress of this issues inside AMD. If you say "Yes,

        it's the CPU's problem and we know the root cause," my works are needless, I can

        finish this work, and it's the best for me. But it's not, unfortunately.

         

        * Detail

         

        Here is what I've done after the last post.

         

        1. Which event caused SEGV?

         

        All of them are General Protection Failure. It can be found by the following

        kernel official tracer.

         

        Reproducer:

        ryzen-problem-repro2 · GitHub

         

        A shell script takes kernel's back trace on SEGV:

        trace-signal.sh · GitHub

         

        Result:

        ryzen_segv_ftrace_signal_generate · GitHub

         

        2. What kind of GPFs?

         

        - All of them are caused by unknown (error code is 0) reason

        - They happened not only on the specific CPU, but also many CPUs

        - At "3289.263026" and "3289.455664", there are two successive GPFs on the same CPU and

          on the different processes. It appeared as the two successive SEGVs on the reproducer's log.

          It might be that the first GPF, or other event, destabilizes CPU for a short period.

        - The IPs when GPFs occurred are in the several narrow memory regions. Some of them had

          the completely the same addresses.

         

        A kernel patch for linux v4.11.4 which gathers GPFs' information at the kernel level:

        ryzen-segv-tracer-to-v4.11.4.patch · GitHub

         

        Reproducer's log:

        ryzen-segv-build-log.txt · GitHub

         

        Kernel log:

        https://gist.github.com/satoru-takeuchi/f473f7eba08331387032654ad6f3e4dc

         

        3. What kind instructions which were pointed by IPs on GPF and how about

        the pattern of the instructions near that IPs?

         

        - There are many kinds of instructions. If I'm forced to say, test and jmp happened many times.

        - It happened on executing the instructions don't touch any memory, for example jmp.

        - There aren't not clear patterns between that IPs and its neighbours.

         

        The instructions which were pointed by the IPs on GPFs:

        ips-on-segv.txt · GitHub

         

        That instructions (marked as '*' at the beginning of the line) and the instructions neighbourhood.

        ryzen-segv-tracer-log.txt · GitHub

         

        * What I will do the next

         

        Make a reproducer which is more simple and can cause this problem more quick

        • 59. Re: gcc segmentation faults on Ryzen / Linux
          whiskey-foxtrot

          Can you post raw output (not using the script) on Git to see POF please?

          1 of 1 people found this helpful
          • 60. Re: gcc segmentation faults on Ryzen / Linux
            sat

            > Can you post raw output (not using the script) on Git to see POF please?

             

            Let me clarify what you want. My image is to get the log.txt of the build failure case

            in the following command under WSL. Is it correct?

             

            $ cd src/linux

            $ make defconfig

            $ make -j16 &>log.txt

             

            And if correct, which number of this log do you want? Just one is OK?

            • 61. Re: gcc segmentation faults on Ryzen / Linux
              yiyihu

              Hi, my configuration is

              Asrock B350 Pro4 With latest BIOS(AGESA 1.0.0.6)

              r7 1700x

              Gskill Ripjaws V 32G (16G x 2) running at 2133

               

              I try to build gentoo, and meet this bug too, With OpCache enabled, There will be sporadic segfault, (always happen)

              After I disable OpCache, I've compiled whole gentoo for 3 times, There is no segfault anymore.

              So, I believe this is caused by a bug in OpCache.

              I'm just curious, wether this hardware bug can be fixed via BIOS update? I don't mean workaround.

              And if the OpCache is wholly disabled, How much performance impact there will be please?

               

              Thanks!

              • 62. Re: gcc segmentation faults on Ryzen / Linux
                whiskey-foxtrot

                Correct - I'm trying to isolate this and run a comparison as I can't replicate it - it's driving me nuts.

                • 63. Re: gcc segmentation faults on Ryzen / Linux
                  whiskey-foxtrot

                  and to clarify - I'm using gcc 7 - haven't tried anything older which most distros still use.

                  • 64. Re: gcc segmentation faults on Ryzen / Linux
                    whiskey-foxtrot

                    which GCC are you building it against?

                     

                    I'm not saying there aren't any CPU issues - as every release (both Intel and AMD) have them, every time. It generally takes some time for compilers to build work-arounds as they catch up. The most subtle bugs in compilers can trigger errors not experienced with previous CPUs.

                     

                    I'm still trying to replicate this and so far building kernel 4.11.4 on a loop overnight hasn't generated anything sadly. Next step is to downgrade my build system to use anything older than gcc-7; I might as well since I'm not keeping this installation.

                    • 65. Re: gcc segmentation faults on Ryzen / Linux
                      alfonsor

                      here  I never had problems with the kernel; I can let the kernel compile in a loop all day long; an easy way to trigger the problem is to start a mesa compilation with -j16 in a loop with a parallel gcc compilation (whatever version) with -j16; sooner or later mesa fails with a segfault in bash (sometimes gcc segfaults itself)

                       

                      this happens with  the whole system compiled with gcc 5, 6 or 7 with no cflags optimization or with optimizations

                      • 66. Re: gcc segmentation faults on Ryzen / Linux
                        yiyihu

                        The kernel version I use is

                        Linux localhost 4.11.4-gentoo #2 SMP Thu Jun 8 20:59:54 -00 2017 x86_64 AMD Ryzen 7 1700X Eight-Core Processor AuthenticAMD GNU/Linux

                         

                        The gcc versions I tried are both gcc 5.4.0 and gcc 6.3.0, I'm not building the kernel, I just do something like

                        while :; do if ! emerge media-libs/mesa;  then break; fi; done

                        after that, I leave the test pc over night, we may meet the error when OpCache is enabled after a period.

                        This is the way I test if the system is ok.

                        After the mesa-test-loop runs long enough with OpCache disabled, I feel it may be stable,

                        Then I try 'emerge -e system && emerge -e world'. And the command finishes successfully 2 times as far as I tried.

                         

                        With OpCache enabled, I never meet it pass both 'emerge -e system' and 'emerge -e world', Though, sometimes, the 'emerge -e system' may finish.

                         

                        And I have MAKEOPTS='-j 16' in my /etc/portage/make.conf

                        If you need the make.conf, I'll no paste somewhere.

                         

                        Thanks!

                        • 67. Re: gcc segmentation faults on Ryzen / Linux
                          raydude

                          I want to take a different approach.

                          I'm running Gentoo on a Gigabyte mATX mobo, with a Ryzen 5 1600 and Galax DDR4-3600 DRAM. I'm running 1.4 V core, stock cooler with a 750 watt EVGA PSU. I'm running at 3.8 GHz and RAM is running with BIOS default DDR timing at 2933 MHz. I'm running gcc 4.9.4 with no march option.

                           

                          cat /proc/sys/kernel/randomize_va_space
                          2

                           

                          I emerged gcc-6.3 this week and word last week without issue at a -j12. I haven't had any problems.

                          Can someone (with a gigabyte mobo, preferably) give me a BIOS / CPU / DRAM / Voltage configuration that is known to have problems and an emerge that will cause the failure. I want to see if I can reproduce it with my rig.

                           

                          Thanks.

                          1 of 1 people found this helpful
                          • 68. Re: gcc segmentation faults on Ryzen / Linux
                            whiskey-foxtrot

                            I used Mesa v11.2.0 (- default source on Xenial/Ubuntu) since 17.1.2 required way too many dependencies I didn't feel like hunting down.

                             

                            I've ran 14+ compiles (make clean ; make -j16) using gcc-7 and -j 16 without any issues - except for me getting pissed at sloppy errors shown in Mesa itself. I stopped counting but somewhere around the 18th time I did end up with a segfault. This is after I also started running "stress -c 16"! Without running "stress", nothing happens on this system.

                            Screenshot from 2017-06-10 14-19-26.png

                            with strace:

                            Screenshot from 2017-06-10 14-55-50.png

                            I'm not worried about the temps as the fans barely spun up which only happens around the 50c range. I'll have to find some other way to test this as compiling Mesa with all its errors isn't quite reliable as a test.

                            • 69. Re: gcc segmentation faults on Ryzen / Linux
                              alfonsor

                              So you are probably among the lucky ones without the bug. There are many users with the bug and many users without the bug. The weight of those "manies" I don't know.

                               

                              And that is the real problem: not everybody has the bug. How is it possible? Are only some cpus affected?

                               

                              Arg.

                              • 70. Re: gcc segmentation faults on Ryzen / Linux
                                foppe

                                What would interest me more is if there are people with the motherboards identified thus far who aren't running into issues, because I keep thinking this is more a mobo/bios issue (perhaps related to SoC voltage? Voltage regulation ability of the mosfets? etc.) than a 'CPU' issue as such.

                                1 of 1 people found this helpful
                                • 71. Re: gcc segmentation faults on Ryzen / Linux
                                  raydude

                                  I have a Gigabyte AB350M-D3H-CF. I'm using the BIOS F1 2/20/2017, I

                                  believe it's a 1.0.0.4a BIOS.

                                   

                                   

                                  Does anyone with this motherboard have the issue?

                                  1 of 1 people found this helpful
                                  • 72. Re: gcc segmentation faults on Ryzen / Linux
                                    whiskey-foxtrot

                                    I don't know if the issue is just isolated to a few or if it's indeed a general issue with the CPU. Like I said, I got mine to crash, but that's only by running the "stress" program at full blast as well. I would like to know as well, but there's so little centralized information available - and what I would like to see is a reporting form on AMD's site with the variables (cpu, mobo, OS, crashes - broken down per error, etc etc) so we can watch for patterns. Right now we're just trying to piece it all together from spread out sources without a baseline/test or avenue for reporting.

                                     

                                    All my other Ryzen systems are pretty much the same except I also have some 1700X floating around; motherboards are all Asus Crosshair VI Hero, all G.Skill RAM and EVGA PSU's.

                                     

                                    To AMD: Please provide a standardized form just for the new CPUs to help narrow these problems down; limit text input and provide as many options as possible that pertain to Ryzen specifically.

                                    3 4 5 6 7