This workaround doesn't work for me. I set `/proc/sys/kernel/randomize_va_space` to 0 but still received SEGV once out of 2 trials during compiling gcc 6.3 with -j16.
Just wanted to chime in that I'm hitting this too on my 1700. Have never overclocked. Using 2 DDR4-3200 dimms running at 2100 using the default profile. Once segfaults start happening the system pretty quickly destabilized. Haven't had a test to run memtest overnight yet but it definitely fits the profile of what folks are seeing here.
I have been doing some extensive testing over the weekend and, for me, disabling ASLR does not resolve the issue. The only workaround in my case is disabling SMT (contrary to my 1st post, but I think I was dealing with two issues at the beginning - unstable RAM settings and whatever this segfault causing bug is..) I don't know enough about ASLR to know whether this supports or not the issue that Matt described in the Phoronix forum post you linked.
mcl00 I have a similar experience with this - Two identical installations (Ubuntu 16.04.2), same stock kernel (4.4.0-78-generic) - EXCEPT - 1 has a custom waterloop, the other uses a Noctua air cooler; CPUs are both 1700X, 3200MHz ram on latest beta BIOS from Asus (9945 on Crosshair VI Hero) with 1:1 settings.
Any compile with -j16 on the air cooled system would KP after a few minutes; no issues on the water cooled system. Added an AIO on the air cooled system and ran a parallel kernel-source compile on both system et Viola - no issues at all.
It is very possible that the temperature difference made a difference; I'm tempted to switch the AIO back out for the Noctua as a test now that the kernels have been recompiled for these systems instead of using a binary/stock kernel, but I doubt I'll find any issues. If I do, I'll post here.
Speaking of extensive testing, here's an extensive summary. To begin with, I upgraded my system from a Core i5-2500K to the following a few weeks ago:
(new) MSI x370 Gaming Pro Carbon, BIOS 7A32v15 (AGESA 126.96.36.199a)
(new) Ryzen 7 1700
(new) Corsair Vengeance LPX DDR4-3200MHz CL16 2x8GB (CMK16GX4M2B3200C16R) - on the QVL of the MB
Corsair RM650x power supply
(new) Corsair H110i AIO CPU watercooler
Samsung 950pro 250GB and Sandisk Ultra 480GB SSDs
Fractal Design R5 case, 2x140mm case fans
Geforce GTX670 PCIe video card
The old system was stable (i.e. no segfaults when compiling anything).
All operating systems and software has been re-installed from scratch with the new system.
Settings/tests done to troubleshoot the Ryzen 1700 system:
Recent tests were all done with BIOS defaults, Boot mode changed to UEFI only as opposed to UEFI+Legacy, Virtualization enabled (amd-v), memory at XMP profile 1 (2933MHz). This setting passes an overnight run of memtest86 and is stable running prime95 w/16 threads on Windows 10, Ubuntu, and Gentoo for multiple hours without failing. With the maximum power use tests in prime95, CPU temperatures never exceeded 47C, MB temp never exceeded 36C in Windows 10 (I can't currently monitor temps in Linux).
My typical setup that regularly generates the segfaults is using Gentoo, gcc6.3 (-O2 -pipe -march=znver1), make -j16 and emerging (compiling and installing) the mesa-17.0 package in a loop. My cooler is set on quiet mode, and case fans are controlled by the MB. With those options, I will generate a segfault (usually in /bin/sh) approximately once every 1-4 loops - in other words, I will successfully compile and install mesa 0-3 times before it segfaults.
The following lists how many loops through compiling and installing mesa were successful before I got a segfault in each of the different scenarios below:
Case and AIO cooler at maximum fan speed: 2 loops
make -j8: 4 loops
RAM at 2133MHz (JEDEC setting, CL15): 1 loop
RAM at 1866MHz, 1.2V, CL18: 2 loops
RAM at 2133MHz, AMD Cool'n'Quiet disabled: 3 loops
as above, plus LLC set to mode 2 for CPU and NB voltage: 2 loops
as above, plus CPU voltage fixed at 1.25V: 3 loops
as above, plus NB voltage fixed at 1.15V: 1 loop
as above, but with LLC set to mode 4 rather than 2 for CPU and NB voltage: 3 loops
Turbo disabled, C6 state disabled, CPU frequency set to 3.2GHz, LLC mode 4, CPU voltage 1.35V, NB voltage 1.15, LLC mode 4: 2 loops
as above, but RAM back to XMP profile 1 (2933MHz, 1.35V): 3 loops
Default settings, RAM 2933MHz, ASLR disabled: 5 loops
SMT disabled (RAM 2933MHz) and make -j9: 172+ loops (ran overnight without segfault, killed it manually in the morning).
With SMT disabled (RAM at 2933MHz, everything else default), I recompiled my entire system with gcc set to -02 -pipe -mtune=generic to eliminate any optimizations for the Zen architecture. There were no segfaults during that time. I then re-enabled SMT and tried again.
make -j16: 7 loops (I started to get excited...)
make -j16, ASLR disabled: 5 loops
Disable 2 cores (3+3): 4 loops
Disable 4 cores (4+0): 6 loops
Testing in Ubuntu 16.04, there were too many dependencies/libraries to install for me to test compiling mesa from source under Ubuntu. That said, with default settings and compiling gcc 5.4 from source (with whatever gcc is installed with apt-get install gcc... I think it's 4.5.3) I also get segfaults after anywhere from 5-15 minutes of compiling.
I know nothing of compiling software in Windows, so I have not been able to test that.
I'm seeing the same. Gigabyte B350 Gaming 3, Ryzen 1700, Linux kernel 4.10.11. They occur about once per week.
May 29 15:15:23 beast kernel: [1193216.141676] mce: [Hardware Error]: Machine check events logged
May 29 15:15:23 beast kernel: [1193216.141684] [Hardware Error]: Corrected error, no action required.
May 29 15:15:23 beast kernel: [1193216.141689] [Hardware Error]: CPU:9 (17:1:1) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000b0151
May 29 15:15:23 beast kernel: [1193216.141693] , Syndrome: 0x000000004a000000, IPID: 0x000100b000000000
May 29 15:15:23 beast kernel: [1193216.141695] [Hardware Error]: Instruction Fetch Unit Extended Error Code: 11
May 29 15:15:23 beast kernel: [1193216.141696] [Hardware Error]: Instruction Fetch Unit Error: L2 BTB multi-match error.
May 29 15:15:23 beast kernel: [1193216.141698] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
I ran my reproducer, building linux kernel with make -j16, on WSL
and it failed at random. In addition, I heard from a person that NetBSD caused
SEGV and kernel panic at radom under heavy compilation workload.
He also said that this problem disappeared after disabling ASLR. It means
the probability of hardware problem, I guess it's a Ryzen's problem,
* Detailed Information
Although it didn't die with SEGV, but, I consider it's caused by the
difference between underneath kernel.
* Additional information:
a. I'll go back to linux and will gather tracing information on SEGV,
like accessed addresses (both virtual one and physical one), and
which instruction was executing and so on.
b. My motherboard is ASUS PRIME X370-pro and my BIOS is the newest
0612, but, unfortunately unfortunately it seems not to contain
the newest 188.8.131.52. My BIOS setting is the default and no OC,
no memtest error.
c. There was the different error message than "fork: Invalid argument"
as follows while running the same reproducer with
the old BIOS (AGESA 184.108.40.206a)
I have the same problem. Hardware: R5 1600, Asus B350M-A with latest UEFI and 2x8 GB Samsung memory 2400 MHz 17-17-17-39. All settings in UEFI are on default values, I've only switched on SVM support. OS: Ubuntu 16.04 with kernel 4.11.3 from Index of /~kernel-ppa/mainline .
The system is very stable except from random build failures with gcc. For testing I used loop compilation of linux kernel:
1. All stock and default: 2 build fails in 59 compilations.
2. ASLR disabled: 0 build fails in 65 compilations. I'll test this more in the future.
3. SMT disabled: 3 build fails in 13 compilations.
4. ASLR and SMT disabled: 0 build fails in 33 compilations.
So disabling ASLR does seem to help and disabling SMT doesn't.
Random failure of my reproducer still happen on WSL with disabling ASLR.
There are many reasons of error messages. Since WSL is a blackbox for me,
I can't know why such kind of messages are shown on WSL compare with
a SEGV message on Linux and why it failed with disabling ASLR.
NOTE: I confirmed the memory map is changed for each process creation with
enabling ASLR (default) by seeing /proc/<pid>/maps. In addition, it is not
changed for each process creation with disabling ASLR.
I'm running a Ryzen R7 1800X on an Asus X370 Pro with 32GB 2400 MHz RAM, BIOS 0612, no overclocking at all.
My system is a "silent" one, completely passively cooled - see "Leise PC - PC Silentium! AMD - Hauptkomponenten " for reference. Under constant compile load, sensors indicate that the CPU temperature never exceeds 75°C.
I can compile kernel sources with gcc-7.1 "make -j 16" for many hours in a loop, without ever encountering core dumps.
However, I do encounter segfaults seldom (like every few days) at the very beginning of any highly-parallel process suddently utilizing all 16 threads to their max, but that only seems to happen when the CPU was idle before - it does never happen once the CPU got stressed for at least 10 seconds or so before. As if the sudden ramp-up in power use or temperature was the culprit.