74 75 76 77 78 1,862 Replies Latest reply on Dec 12, 2017 9:29 AM by jc_yang Go to original post
      • 1,125. Re: gcc segmentation faults on Ryzen / Linux
        chithanh

        runningman wrote:

         

        In case some can't see this: This is a hardware defect. It won't be fixed by a software update.

         

        These CPUs are broken. They have flaws which are subtle enough to allow the CPUs to pass QA, but they're not reliable. AMD doesn't have a clue if they're still manufacturing such broken CPUs.

        This is pure FUD which you are spewing here. Hardware defects are worked around in microcode and operating systems all the time (just recently Intel provided a microcode update for the Skylake HT bug). And AMD can test for this problem now, so will not ship any more broken CPUs.

         

        The CPUs which are currently in the sales channel though may be affected, so beware of buying Ryzens unless you are prepared to deal with AMD's RMA department.

        1 of 2 people found this helpful
        • 1,126. Re: gcc segmentation faults on Ryzen / Linux
          chithanh

          If it is a HW bug, AMD will have to recall all the CPUs, because once this bug is determined, it will be exploitable. Any machine using an affected Ryzen chip will be vulnerable and insecure, because this bug causes any software to behave like it has many buffer/stack overflows, it dereferences and/or executes data. No matter how good your software - on a Ryzen it gets all the worst security bugs.

          In order to trigger such a bug, you will have to perform a very specific set of computations. In order to exploit it, you will have to get control over the execution of a process. Added difficulty is that usually not the gcc process itself, but one of the callers/callees (bash, as, etc.) will segfault. I think it is not at all comparable to a buffer overflow.

           

          AMD may have to recall the CPUs anyway (and they should have recalled at least the potentially affected ones in the sales channel), but calling this a security bug and likening it to a buffer overflow is nonsense.

          1 of 2 people found this helpful
          • 1,127. Re: gcc segmentation faults on Ryzen / Linux
            runningman

            chithanh wrote:

             

            If it is a HW bug, AMD will have to recall all the CPUs, because once this bug is determined, it will be exploitable. Any machine using an affected Ryzen chip will be vulnerable and insecure, because this bug causes any software to behave like it has many buffer/stack overflows, it dereferences and/or executes data. No matter how good your software - on a Ryzen it gets all the worst security bugs.

            In order to trigger such a bug, you will have to perform a very specific set of computations. In order to exploit it, you will have to get control over the execution of a process. Added difficulty is that usually not the gcc process itself, but one of the callers/callees (bash, as, etc.) will segfault. I think it is not at all comparable to a buffer overflow.

             

            AMD may have to recall the CPUs anyway (and they should have recalled at least the potentially affected ones in the sales channel), but calling this a security bug and likening it to a buffer overflow is nonsense.

            Some people have unstable computers which freeze or hang. Are you an AMD employee?

            • 1,128. Re: gcc segmentation faults on Ryzen / Linux
              constantinx

              runningman

              Stop spewing hate. It's obvious you're clueless about computers.

              I won't go through the amount of nonsense that exists in your messages. Just stop!

              2 of 3 people found this helpful
              • 1,129. Re: gcc segmentation faults on Ryzen / Linux
                xtronom

                chithanh It's not really that difficult. The trigger will be known eventually, so it will be very easy to reproduce this problem. But even with the software we currently have, it is probably doable. It is sufficient for many exploits that a bug is triggered eventually.

                 

                A workable exploit could be constructed with libxml. It experiences crashes when processing XML documents. The attacker only needs to construct an XML file which gets processed by the victim. A typical error which happens in libxml is that a raw pointer value (not the memory it points to) gets a value from the XML document (e.g. node name), which is then dereferenced. This is not a libxml bug.

                 

                I even created a program which crashes and doesn't do much: allocates some memory and sorts a list. This problem is not bound to GCC, it affects everything.

                2 of 2 people found this helpful
                • 1,130. Re: gcc segmentation faults on Ryzen / Linux
                  alonzotg

                  Okay, here's my theory in more formal terms. First, you must understand that there are both "full fledged processes" (FFPs) and Light Weight Processes (aka threads...). To switch between threads the only thing you need to do is swap between different register files. This is the preferred method for high performance because it saves a lot of overhead.

                   

                  FFPs, on the other hand, are basically different virtual machines. That means that the operating system must not only swap out the registers, but must also swap out the address translation tables as well. This means page tables, historically it would have meant segment registers but those have been deprecated in the current architecture. This also means that all caching that occurs subsequent to the address translation tables must be invalidated because it refers to values stored in a different virtual machine.

                   

                  From this and the problems that have been reported, we can assume that the steps to reproduce are as follows:

                  -> Create a large number of processes (ie #hyperthreads + 1) such that there is contention for the number of active register files (threads) on the processor.

                  -> make system calls such that loading header files or something that causes the operating system to suspend these frequently.

                  -> due to contention the operating system will frequently try to replace the suspended process with other processes in the pool with different address space mappings.

                  -> The newly awakened process must then hit a code path, or place a certain instruction in a certain pipeline slot that was not correctly flushed during task switching which ends up causing the processor's core logic to send an invalid memory request to the address translation logic/memory management unit which then triggers a fault. -- It is also possible that it happens the other way around that the core works correctly but then the address translation logic is stale such that even though the request is valid it triggers a fault anyway...

                   

                  Oh well, nobody's paid me to program in many years so what do I know?

                  • 1,131. Re: gcc segmentation faults on Ryzen / Linux
                    chithanh

                    runningman wrote:

                     

                    AMD may have to recall the CPUs anyway (and they should have recalled at least the potentially affected ones in the sales channel), but calling this a security bug and likening it to a buffer overflow is nonsense.

                    Some people have unstable computers which freeze or hang.

                    The freeze or hang problems are by all indications unrelated to the segfault bug. For the former, it is not even clear that it is a hardware issue. And even if it were, it is still possible that PSU or other hardware is at fault instead of the CPU.

                    runningman wrote:

                     

                    Are you an AMD employee?

                    I am a Gentoo developer.

                    I do not work for AMD.

                    • 1,132. Re: gcc segmentation faults on Ryzen / Linux
                      chithanh

                      xtronom wrote:

                       

                      chithanh It's not really that difficult. The trigger will be known eventually, so it will be very easy to reproduce this problem. But even with the software we currently have, it is probably doable. It is sufficient for many exploits that a bug is triggered eventually.

                      If you look at actual exploits of this kind of CPU bug, I'd contest the statements "not really that difficult" and "probably".

                      A workable exploit could be constructed with libxml. It experiences crashes when processing XML documents.

                      How? Where? And is that confirmed to be the same problem as the gcc segfaults?

                      • 1,133. Re: gcc segmentation faults on Ryzen / Linux
                        xtronom

                        chithanh

                        Your "this kind of CPU bug" is unfounded too. The bug behaves like a SW bug (read this thread from the start), therefore I'm making assumptions based on SW exploits, which is completely valid, because it is the SW that gets exploited not the CPU. CPU only helps.

                         

                        Libxml was reported here, around page 20-30 I think. It is the same problem. You don't see gcc actually crash many times during a build. Mostly you see bash crash, usually while invoking libtool. If you analyse core dumps from these crashes and compare them to libxml (or any other) crashes, you see it's the same bug. And its not random nor garbage nor a simple one-bit-off error.

                         

                        Moreover if you analyse core dumps from crashes on a different HW, you see identical things. And this behaviour is reliable and very stable.

                        • 1,134. Re: gcc segmentation faults on Ryzen / Linux
                          raydude

                          There seems to be a lot of confusion in this thread about the nature of the issue with the Ryzen CPUs.

                           

                          I'm a hardware engineer with 25 years of experience. I've designed everything from boards to chips to power supplies.

                           

                          There are two primary issues:

                          A. Under very heavy load, addresses get fowled up and bad data is read from cache. Sometimes the data is actual data, sometimes it is instructions. We know this because our crashes are sometimes bad data (I had a bash counter change from 4 to 0 without a crash) and sometimes we get illegal instructions as shown in dmesg quite often. These failures are often off the 64 bit boundary that instructions are supposed to be aligned to.

                           

                          B. Under light load the system will sometimes spontaneously reboot. I have never seen this issue and my RMA does not have it, at least with my Gigabyte motherboard. This CStates issue is still open and I hope AMD is reproducing it.

                           

                          Here is the summary of facts, followed by my speculation:

                           

                          1. I personally have never heard of a Ryzen that doesn't have the issue before the 1725 SUS parts. Many people who'd never had the issue came here, downloaded the script, booted linux and reproduced the issue. The issue has been seen with BSD as well.

                           

                          Speculation: Even if there are parts that don't fail, the failure rate is much higher than AMD will admit. They have been very careful to pin the issue on linux under heavy load, but I believe that eventually windows applications that hit all threads hard will show the issue.

                           

                          2. The 1725 SUS and later parts do not have a different microcode or stepping. So the chips haven't changed.

                           

                          Speculation: The only thing that could have changed in the parts that work is the package and the binning. I don't know what they've done to create working silicon, but they have clearly identified the issue and IMO resolved it.

                           

                          3. There is only one person who's had a problem with 1725 SUS or later (it's a 1730 SUS).

                           

                          Speculation: we don't know the conditions for his testing so we can't be sure that he isn't overclocking, using a cheap power supply, installed his heat sink incorrectly, or any other of the innumerable reasons for the CPU to fail. He seriously needs to work with AMD so they can attempt to understand the nature of the failure and RMA it to get to the bottom of the failure yet again. Us arguing about it is absolutely meaningless.

                           

                          Further opinion: It is a waste of time at this point to try to make your system stable by increasing voltage, disabling SMT or ASLR, etc. If you have the issue, open a case, follow their instructions and then get an RMA. You cannot fix this problem with BIOS changes. IMO AMD cannot fix this problem with microcode.

                           

                          I understand why people are upset. I was too when I found out my $220.00 CPU was borked. But being upset and slinging it around is not productive. Work with support. Understand they going to be really busy for a few months, and try to be patient.

                           

                          As I said before, we are just seeing the beginning of this fiasco. It's only going to get more complicated from here. If AMD is smart (and I think they are) they are building parts as fast as they can so they can fulfill RMAs as quickly as possible.

                           

                          I think they will eventually satisfy all of us, it's just a matter of time.

                          4 of 4 people found this helpful
                          • 1,135. Re: gcc segmentation faults on Ryzen / Linux
                            oldamdfan

                            I still don't think we have any confirmation that ALL week 25+ CPUs are good.  We only know that the RMA CPUs happen to be week 25+, and that AMD customer support is testing them for >24 hours on an internal testing setup before providing them as RMA replacements.

                             

                            The fact that people's RMA are taking longer to process as more people become aware of this issue would indicate to me that they are still doing this testing, vs just pulling from a pile of known-good CPUs.

                             

                            Eventually they would get 10s of RMAs down the line and decide "this batch is good no testing is needed", it does not appear that this has occurred.

                            1 of 1 people found this helpful
                            • 1,136. Re: gcc segmentation faults on Ryzen / Linux
                              mrs

                              oldamdfan schrieb:

                               

                              I still don't think we have any confirmation that ALL week 25+ CPUs are good. We only know that the RMA CPUs happen to be week 25+, and that AMD customer support is testing them for >24 hours on an internal testing setup before providing them as RMA replacements.

                               

                              The fact that people's RMA are taking longer to process as more people become aware of this issue would indicate to me that they are still doing this testing, vs just pulling from a pile of known-good CPUs.

                               

                              Eventually they would get 10s of RMAs down the line and decide "this batch is good no testing is needed", it does not appear that this has occurred.

                              Finally, today my replacement arrived: A 1725SUS
                              I will report later how it went. Still have to find time to test it.

                              It came in the original box with a hand-written note attached which said: "Passed"
                              It seems hat they opened the box from below...

                              3 of 3 people found this helpful
                              • 1,137. Re: gcc segmentation faults on Ryzen / Linux
                                udamanfunks

                                @mrs

                                 

                                what type is your chip?  (r7-1700?) and what was the ua code on the one you sent back?

                                 

                                Trying to make sure I keep the following up to date (if something is missing, LMK).

                                 

                                RYZEN SEGV DATA - Google Sheets

                                • 1,138. Re: gcc segmentation faults on Ryzen / Linux
                                  nop

                                  My replacement R7 1700X from the 1725SUS batch has also just arrived.

                                  The retail box also bears a handwritten note carrying the word "Passed" plus my name and "1 2 3 NRD" (in a vertical column) on it.

                                  I'll start testing tomorrow and report back when it makes sense.

                                  3 of 3 people found this helpful
                                  • 1,139. Re: gcc segmentation faults on Ryzen / Linux
                                    raydude

                                    oldamdfan wrote:

                                     

                                    I still don't think we have any confirmation that ALL week 25+ CPUs are good. We only know that the RMA CPUs happen to be week 25+, and that AMD customer support is testing them for >24 hours on an internal testing setup before providing them as RMA replacements.

                                     

                                    The fact that people's RMA are taking longer to process as more people become aware of this issue would indicate to me that they are still doing this testing, vs just pulling from a pile of known-good CPUs.

                                     

                                    Eventually they would get 10s of RMAs down the line and decide "this batch is good no testing is needed", it does not appear that this has occurred.

                                    Thanks for the reply. Your post and mrs' post makes me think that you are right!

                                     

                                    Tech Support might actually be testing parts themselves to find parts that work. I knew this was a possibility but when someone on Reddit mentioned he got a 1727 SUS from retail that didn't have the problem I jumped to the conclusion that they had put a fix in at SUS.

                                     

                                    This means that design engineering / process engineering may not yet be involved!

                                     

                                    This means that date code doesn't mean anything.

                                     

                                    Although it does show that there are parts that don't suffer from the problem which is something I strongly doubted up until now.

                                     

                                    You know: I wish they would come clean about this.

                                     

                                    Thanks for reminding me and I'm sorry for my over optimism... Ugh.

                                    1 of 1 people found this helpful
                                    74 75 76 77 78