cancel
Showing results for 
Search instead for 
Did you mean: 

Drivers & Software

alfonsor
Adept II

Re: gcc segmentation faults on Ryzen / Linux

Days are passing and still no answer. The discovery that disabling ASLR alleviates and quite removes all segfaults for the majority of the users should be both an evidence that the problem is real and a starting point to solve it, not a fix, just a temporary workaround. AMD, what is the real problem? Can we hope to have a fix in ryzen (1)?

space
Adept I

Re: gcc segmentation faults on Ryzen / Linux

@AMD: Please check the gentoo forums where a technical discussion is going on about this issue:

https://forums.gentoo.org/viewtopic-p-8075980.html#8075980

Edit: Quote from that post: All this points to a possible bug in Ryzen's micro-op cache perhaps triggered by "CMP/TEST conditional jump" instruction fusion μops.

Best regards,

   Space

0 Likes
cl0p3z
Adept I

Re: gcc segmentation faults on Ryzen / Linux

The message linked below contains very interesting pointers:

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads - Phoronix Forums

It seems DragonFly BSD has a patch to workaround what looks like a hardware bugs on Ryzen (which in theory AMD is already aware of).

0 Likes
mcl00
Adept III

Re: gcc segmentation faults on Ryzen / Linux

I may have have isolated my issue to the motherboard rather than the CPU itself. I got my hands on a Ryzen 5 1600X and ASRock Taichi x370 MB (BIOS P1.60 - pre-AGESA-1.0.0.4a) so I was able to do some mixing and matching of components.

I swapped out my R7-1700 for the R5-1600X and was able to reproduce the compiler segfaults at default settings very rapidly.

I then swapped out my motherboard (MSI x370 Gaming Pro Carbon) for the ASRock x370 Taichi and installed the R5-1600X in that board. I loaded the BIOS defaults and ran my tests. The system was significantly more stable while compiling, but it did 'hard' lockup on one overnight run (after 65 loops of compiling mesa-17.0) - i.e. the entire system froze and was unresponsive to the point that I had to power it off.

Finally, I updated the BIOS on the ASRock board to P2.30 (AGESA 1.0.0.4a), installed my R7-1700 and ran my tests in that configuration. With default BIOS settings, the R7-1700 was able to compile software all night with no segfault or hard lockup.

So I'm going to try to RMA my MSI Board as it seems to be the common denominator in my case for the lockups.

As a side note, if you are wondering why I didn't try the R5 with the new BIOS with the 1.0.0.4a microcode update to see if it would fix the hard lock-up issue I encountered, it's because I don't know enough about microcode updates in processors to know whether I could reverse that update. Since the R5 is not mine, I did not want to make any irreversible changes

foppe
Adept I

Re: gcc segmentation faults on Ryzen / Linux

Would suggest that you wait for agesa 1006-based bios before returning your GPC. I have a suspicion that this is at least partly caused by power/voltage issues (controlled by the bios, and thus amenable to fixing via bios updates), and agesa 1006 final bioses have been released for most motherboards in the past 48h.

cl0p3z
Adept I

Re: gcc segmentation faults on Ryzen / Linux

Microcode updates only last as long as the CPU is powered on.

So, they have to be applied each time you boot the system.

Usually the BIOS or the OS in early boot stages is responsible to do that.

Check: Microcode - Debian Wiki

whiskey-foxtrot
Forerunner

Re: gcc segmentation faults on Ryzen / Linux

I'm glad you were able to figure it out - or at least come up with a more stable solution. Keep us posted if anything else happens.

"As a side note, if you are wondering why I didn't try the R5 with the new BIOS with the 1.0.0.4a microcode update to see if it would fix the hard lock-up issue I encountered, it's because I don't know enough about microcode updates in processors to know whether I could reverse that update. Since the R5 is not mine, I did not want to make any irreversible changes"

As for the OS-side microcode updates, those are generally store in /etc/firmware; besides that, support for various processors in the same line are grouped in that .dat file.

mrwwhitney
Adept I

Re: gcc segmentation faults on Ryzen / Linux

The reasoning behind the DragonFly BSD patch is located on page 7 see post by Matt Dillon

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads - Phoronix Forums

sat
Adept III

Re: gcc segmentation faults on Ryzen / Linux

> a. I'll go back to linux and will gather tracing information on SEGV,

>     like accessed addresses (both virtual one and physical one), and

>     which instruction was executing and so on.

I did the above mentioned investigation and got some more information from other Ryzen users.

Here is the summary(details are below).

- The prime suspect is still Ryzen

- The SEGVs which gcc(cc1) get is caused by unknown General Protection Fault(error code is 0)

- The GPFs happened not on the specific CPU, but on the many random CPUs.

- There is a case of the successing GPFs on the same CPU in the very short period.

- The IPs on SEGV point to the variety of instructions like move, test, jmp, and so on. And they

  happened reside on the several narrow memory regions.

- The following known two logics can't explain the all SEGV cases.

  a. Small regions of dense test/jmp instructions hit uop cache

     => Someone said the SEGVs still happened after disabling uop cache

https://www.reddit.com/r/programming/comments/6f08mb/compiling_with_ryzen_cpus_on_linux_causing_rand...

b. iretq under SMT destabilize CPU

     => Disabling SMT didn't fix this problem at least in my case

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads - Phoronix Forums

Again, I request AMD to tell us the current progress of this problem with considering

the information which users, including me, provided here. Whether it's a Ryzen's

problem or not? Whether can you provide the proper workaroud and/or new AGESA

with fixing this problem? If you need some help, I can provide information as you need.

I've investigated this issues in detail because this problem is critical for me and AMD

showed almost nothing about the progress of this issues inside AMD. If you say "Yes,

it's the CPU's problem and we know the root cause," my works are needless, I can

finish this work, and it's the best for me. But it's not, unfortunately.

* Detail

Here is what I've done after the last post.

1. Which event caused SEGV?

All of them are General Protection Failure. It can be found by the following

kernel official tracer.

Reproducer:

ryzen-problem-repro2 · GitHub

A shell script takes kernel's back trace on SEGV:

trace-signal.sh · GitHub

Result:

ryzen_segv_ftrace_signal_generate · GitHub

2. What kind of GPFs?

- All of them are caused by unknown (error code is 0) reason

- They happened not only on the specific CPU, but also many CPUs

- At "3289.263026" and "3289.455664", there are two successive GPFs on the same CPU and

  on the different processes. It appeared as the two successive SEGVs on the reproducer's log.

  It might be that the first GPF, or other event, destabilizes CPU for a short period.

- The IPs when GPFs occurred are in the several narrow memory regions. Some of them had

  the completely the same addresses.

A kernel patch for linux v4.11.4 which gathers GPFs' information at the kernel level:

ryzen-segv-tracer-to-v4.11.4.patch · GitHub

Reproducer's log:

ryzen-segv-build-log.txt · GitHub

Kernel log:

https://gist.github.com/satoru-takeuchi/f473f7eba08331387032654ad6f3e4dc

3. What kind instructions which were pointed by IPs on GPF and how about

the pattern of the instructions near that IPs?

- There are many kinds of instructions. If I'm forced to say, test and jmp happened many times.

- It happened on executing the instructions don't touch any memory, for example jmp.

- There aren't not clear patterns between that IPs and its neighbours.

The instructions which were pointed by the IPs on GPFs:

ips-on-segv.txt · GitHub

That instructions (marked as '*' at the beginning of the line) and the instructions neighbourhood.

ryzen-segv-tracer-log.txt · GitHub

* What I will do the next

Make a reproducer which is more simple and can cause this problem more quick

whiskey-foxtrot
Forerunner

Re: gcc segmentation faults on Ryzen / Linux

Can you post raw output (not using the script) on Git to see POF please?