Days are passing and still no answer. The discovery that disabling ASLR alleviates and quite removes all segfaults for the majority of the users should be both an evidence that the problem is real and a starting point to solve it, not a fix, just a temporary workaround. AMD, what is the real problem? Can we hope to have a fix in ryzen (1)?
@AMD: Please check the gentoo forums where a technical discussion is going on about this issue:
Edit: Quote from that post: All this points to a possible bug in Ryzen's micro-op cache perhaps triggered by "CMP/TEST conditional jump" instruction fusion μops.
The message linked below contains very interesting pointers:
It seems DragonFly BSD has a patch to workaround what looks like a hardware bugs on Ryzen (which in theory AMD is already aware of).
I may have have isolated my issue to the motherboard rather than the CPU itself. I got my hands on a Ryzen 5 1600X and ASRock Taichi x370 MB (BIOS P1.60 - pre-AGESA-220.127.116.11a) so I was able to do some mixing and matching of components.
I swapped out my R7-1700 for the R5-1600X and was able to reproduce the compiler segfaults at default settings very rapidly.
I then swapped out my motherboard (MSI x370 Gaming Pro Carbon) for the ASRock x370 Taichi and installed the R5-1600X in that board. I loaded the BIOS defaults and ran my tests. The system was significantly more stable while compiling, but it did 'hard' lockup on one overnight run (after 65 loops of compiling mesa-17.0) - i.e. the entire system froze and was unresponsive to the point that I had to power it off.
Finally, I updated the BIOS on the ASRock board to P2.30 (AGESA 18.104.22.168a), installed my R7-1700 and ran my tests in that configuration. With default BIOS settings, the R7-1700 was able to compile software all night with no segfault or hard lockup.
So I'm going to try to RMA my MSI Board as it seems to be the common denominator in my case for the lockups.
As a side note, if you are wondering why I didn't try the R5 with the new BIOS with the 22.214.171.124a microcode update to see if it would fix the hard lock-up issue I encountered, it's because I don't know enough about microcode updates in processors to know whether I could reverse that update. Since the R5 is not mine, I did not want to make any irreversible changes
Would suggest that you wait for agesa 1006-based bios before returning your GPC. I have a suspicion that this is at least partly caused by power/voltage issues (controlled by the bios, and thus amenable to fixing via bios updates), and agesa 1006 final bioses have been released for most motherboards in the past 48h.
Microcode updates only last as long as the CPU is powered on.
So, they have to be applied each time you boot the system.
Usually the BIOS or the OS in early boot stages is responsible to do that.
Check: Microcode - Debian Wiki
I'm glad you were able to figure it out - or at least come up with a more stable solution. Keep us posted if anything else happens.
"As a side note, if you are wondering why I didn't try the R5 with the new BIOS with the 126.96.36.199a microcode update to see if it would fix the hard lock-up issue I encountered, it's because I don't know enough about microcode updates in processors to know whether I could reverse that update. Since the R5 is not mine, I did not want to make any irreversible changes"
As for the OS-side microcode updates, those are generally store in /etc/firmware; besides that, support for various processors in the same line are grouped in that .dat file.
> a. I'll go back to linux and will gather tracing information on SEGV,
> like accessed addresses (both virtual one and physical one), and
> which instruction was executing and so on.
I did the above mentioned investigation and got some more information from other Ryzen users.
Here is the summary(details are below).
- The prime suspect is still Ryzen
- The SEGVs which gcc(cc1) get is caused by unknown General Protection Fault(error code is 0)
- The GPFs happened not on the specific CPU, but on the many random CPUs.
- There is a case of the successing GPFs on the same CPU in the very short period.
- The IPs on SEGV point to the variety of instructions like move, test, jmp, and so on. And they
happened reside on the several narrow memory regions.
- The following known two logics can't explain the all SEGV cases.
a. Small regions of dense test/jmp instructions hit uop cache
=> Someone said the SEGVs still happened after disabling uop cache
b. iretq under SMT destabilize CPU
=> Disabling SMT didn't fix this problem at least in my case
Again, I request AMD to tell us the current progress of this problem with considering
the information which users, including me, provided here. Whether it's a Ryzen's
problem or not? Whether can you provide the proper workaroud and/or new AGESA
with fixing this problem? If you need some help, I can provide information as you need.
I've investigated this issues in detail because this problem is critical for me and AMD
showed almost nothing about the progress of this issues inside AMD. If you say "Yes,
it's the CPU's problem and we know the root cause," my works are needless, I can
finish this work, and it's the best for me. But it's not, unfortunately.
Here is what I've done after the last post.
1. Which event caused SEGV?
All of them are General Protection Failure. It can be found by the following
kernel official tracer.
A shell script takes kernel's back trace on SEGV:
2. What kind of GPFs?
- All of them are caused by unknown (error code is 0) reason
- They happened not only on the specific CPU, but also many CPUs
- At "3289.263026" and "3289.455664", there are two successive GPFs on the same CPU and
on the different processes. It appeared as the two successive SEGVs on the reproducer's log.
It might be that the first GPF, or other event, destabilizes CPU for a short period.
- The IPs when GPFs occurred are in the several narrow memory regions. Some of them had
the completely the same addresses.
A kernel patch for linux v4.11.4 which gathers GPFs' information at the kernel level:
3. What kind instructions which were pointed by IPs on GPF and how about
the pattern of the instructions near that IPs?
- There are many kinds of instructions. If I'm forced to say, test and jmp happened many times.
- It happened on executing the instructions don't touch any memory, for example jmp.
- There aren't not clear patterns between that IPs and its neighbours.
The instructions which were pointed by the IPs on GPFs:
That instructions (marked as '*' at the beginning of the line) and the instructions neighbourhood.
* What I will do the next
Make a reproducer which is more simple and can cause this problem more quick