Having random reboot issues with Ryzen 7 1700 when XFR is enabled.
Disabling Turbo, disabling C6, manually set frequency at 3.2GHz, enabling LLC and increasing core voltage to 1.25v seems to help workaround the issue.
I have been using "sensors" command with https://github.com/groeck/nct6775 driver to watch the voltage of CPU core on my MSI B350M Mortar motherboard.
With default BIOS settings the core voltage sometimes goes up to 1.35v but for mostly it is running in 1.09V. Nothing happened.
With some random changes with BIOS settings the core voltage sticks below 1.19v and never goes up to 1.20v. The problem occurs.
I tried the options you listed (no turbo, no C6, 3.2GHz, LLC and increasing the core voltage) and was still able to generate the segfaults while compiling software, so unfortunately that didn't work for me.
Quick update - i think my earlier error with prime95 in Windows may have been unrelated to the current issue. It could have been heat-related, or possibly I was still trying out 3200MHz with my memory. In any case, I have not been able to reproduce errors with prime95 (v29.1). Windows 10 (9hrs), Gentoo (3.5+hrs) and Ubuntu (2+hrs) can all run prime95 to their hearts content without errors. (The numbers in brackets are the time I left it to run - in no case did prime95 fail a test). These times running prime95 are all with the CPU at stock settings and the RAM at 2933MHz which is the highest 'stable' speed I can get my memory going.
No amount of fiddling with BIOS settings has resulted in a stable system for compiling software for me. I can also confirm that the problem is not heat related though as my AIO water cooler arrived yesterday and after installing it my CPU temps have dropped by almost 20C with no discernible impact on stability while compiling.
There is still the possibility that this is caused by a bug in Linux, as I have not 'crashed' Windows in the same way. I would like to eliminate that possibility, but I don't know how simulate similar CPU activity under Windows. Windows 10 will run prime95, Passmark's benchmark, the free versions of 3Dmark and others with no problems, but these don't really do the same type of activity as compiling large software packages under Linux which tends to be heavily multithreaded for a 3-4 minutes as a bunch of source files get compiled into object files, and then a short period 20s-1m where it's mostly single-threaded but with lots of memory shuffling around as the object files get linked, then back to multithreaded as it compiles the next batch of source files.
I had got segfaults with bash and glibc, so I recompiled them with newest version from ubuntu zesty (I'm runnning ubuntu xenial). No segfaults then, but sudden reboot still exists unless raising voltage.
Rasing core voltage as well as NB/SoC voltage.
Run memtest86+ and see if your memory kit have problems.
My mobo has option to set a fixed core voltage instead of dynamic voltage (offset). You should try that option.
3 of 4 people found this helpful
As suggested on phoronix forum, I tried /proc/sys/kernel/randomize_va_space set to 0 and it seems to do the trick for me, please, try it.
(Of course it should be considered just as starting point to investigate the iusse)
2 of 3 people found this helpful
> As suggested on phoronix forum, I tried /proc/sys/kernel/randomize_va_space set to 0
> and it seems to do the trick for me, please, try it. (Of course it should be considered just as starting point to investigate the iusse)
In my case, this workaround seems to work. Before trying this, I could reproduce this problem at least once per ten linux kernel build (make -j16).
However, after trying this, that build worked fine 100 times without SEGV.
Just a note about overclocking voltages:
MSI X370 SLI Plus BIOS contains a button that overclocks the CPU (Ryzen 5 1600) from 3.2GHz to 3.6GHz and changes the fan envelope. An unexpected issue is that turning the button off does not lower the core voltage (Vcore) back to normal levels, that is back from 1.464 Volt to 1.2 Volt. Normal voltage is restored back to 1.2 Volt by clearing the CMOS.
3.2GHz @ 1.464V is unstable (CPU hits 95℃ and gets automatically throttled from 3.2GHz to about 2.7GHz during stress testing in AIDA64), and 3.2GHz @ 1.2V is stable (max CPU temperature during AIDA64 stress test is 76℃). I didn't test 3.6GHz @ 1.464V, but I would expect the system to be unstable at this voltage as well.
2 of 3 people found this helpful
FYI about this workaround.
a) Its effect
> > As suggested on phoronix forum, I tried /proc/sys/kernel/randomize_va_space set to 0
> > and it seems to do the trick for me, please, try it. (Of course it should be considered just as starting point to investigate the iusse)
> In my case, this workaround seems to work. Before trying this, I could reproduce this problem at least once per ten linux kernel build (make -j16).
> However, after trying this, that build worked fine 100 times without SEGV.
Unfortunately a person in the Phoronix said echo 0 >/proc/sys/kernel/randomize_va_space, it means
disabling ASLR, couldn't bypass this problem.
b) Its logic
This workaround, disabling ASLR, is based on the following Matt's logic.
If it's correct, this problem should disappear after disabling SMT too.
However, in my case, it didn't. I'm asking him why he found that logic.
1 of 1 people found this helpful
This workaround doesn't work for me. I set `/proc/sys/kernel/randomize_va_space` to 0 but still received SEGV once out of 2 trials during compiling gcc 6.3 with -j16.
Just wanted to chime in that I'm hitting this too on my 1700. Have never overclocked. Using 2 DDR4-3200 dimms running at 2100 using the default profile. Once segfaults start happening the system pretty quickly destabilized. Haven't had a test to run memtest overnight yet but it definitely fits the profile of what folks are seeing here.
1 of 1 people found this helpful
I have been doing some extensive testing over the weekend and, for me, disabling ASLR does not resolve the issue. The only workaround in my case is disabling SMT (contrary to my 1st post, but I think I was dealing with two issues at the beginning - unstable RAM settings and whatever this segfault causing bug is..) I don't know enough about ASLR to know whether this supports or not the issue that Matt described in the Phoronix forum post you linked.
mcl00 I have a similar experience with this - Two identical installations (Ubuntu 16.04.2), same stock kernel (4.4.0-78-generic) - EXCEPT - 1 has a custom waterloop, the other uses a Noctua air cooler; CPUs are both 1700X, 3200MHz ram on latest beta BIOS from Asus (9945 on Crosshair VI Hero) with 1:1 settings.
Any compile with -j16 on the air cooled system would KP after a few minutes; no issues on the water cooled system. Added an AIO on the air cooled system and ran a parallel kernel-source compile on both system et Viola - no issues at all.
It is very possible that the temperature difference made a difference; I'm tempted to switch the AIO back out for the Noctua as a test now that the kernels have been recompiled for these systems instead of using a binary/stock kernel, but I doubt I'll find any issues. If I do, I'll post here.
2 of 3 people found this helpful
Speaking of extensive testing, here's an extensive summary. To begin with, I upgraded my system from a Core i5-2500K to the following a few weeks ago:
(new) MSI x370 Gaming Pro Carbon, BIOS 7A32v15 (AGESA 126.96.36.199a)
(new) Ryzen 7 1700
(new) Corsair Vengeance LPX DDR4-3200MHz CL16 2x8GB (CMK16GX4M2B3200C16R) - on the QVL of the MB
Corsair RM650x power supply
(new) Corsair H110i AIO CPU watercooler
Samsung 950pro 250GB and Sandisk Ultra 480GB SSDs
Fractal Design R5 case, 2x140mm case fans
Geforce GTX670 PCIe video card
The old system was stable (i.e. no segfaults when compiling anything).
All operating systems and software has been re-installed from scratch with the new system.
Settings/tests done to troubleshoot the Ryzen 1700 system:
Recent tests were all done with BIOS defaults, Boot mode changed to UEFI only as opposed to UEFI+Legacy, Virtualization enabled (amd-v), memory at XMP profile 1 (2933MHz). This setting passes an overnight run of memtest86 and is stable running prime95 w/16 threads on Windows 10, Ubuntu, and Gentoo for multiple hours without failing. With the maximum power use tests in prime95, CPU temperatures never exceeded 47C, MB temp never exceeded 36C in Windows 10 (I can't currently monitor temps in Linux).
My typical setup that regularly generates the segfaults is using Gentoo, gcc6.3 (-O2 -pipe -march=znver1), make -j16 and emerging (compiling and installing) the mesa-17.0 package in a loop. My cooler is set on quiet mode, and case fans are controlled by the MB. With those options, I will generate a segfault (usually in /bin/sh) approximately once every 1-4 loops - in other words, I will successfully compile and install mesa 0-3 times before it segfaults.
The following lists how many loops through compiling and installing mesa were successful before I got a segfault in each of the different scenarios below:
Case and AIO cooler at maximum fan speed: 2 loops
make -j8: 4 loops
RAM at 2133MHz (JEDEC setting, CL15): 1 loop
RAM at 1866MHz, 1.2V, CL18: 2 loops
RAM at 2133MHz, AMD Cool'n'Quiet disabled: 3 loops
as above, plus LLC set to mode 2 for CPU and NB voltage: 2 loops
as above, plus CPU voltage fixed at 1.25V: 3 loops
as above, plus NB voltage fixed at 1.15V: 1 loop
as above, but with LLC set to mode 4 rather than 2 for CPU and NB voltage: 3 loops
Turbo disabled, C6 state disabled, CPU frequency set to 3.2GHz, LLC mode 4, CPU voltage 1.35V, NB voltage 1.15, LLC mode 4: 2 loops
as above, but RAM back to XMP profile 1 (2933MHz, 1.35V): 3 loops
Default settings, RAM 2933MHz, ASLR disabled: 5 loops
SMT disabled (RAM 2933MHz) and make -j9: 172+ loops (ran overnight without segfault, killed it manually in the morning).
With SMT disabled (RAM at 2933MHz, everything else default), I recompiled my entire system with gcc set to -02 -pipe -mtune=generic to eliminate any optimizations for the Zen architecture. There were no segfaults during that time. I then re-enabled SMT and tried again.
make -j16: 7 loops (I started to get excited...)
make -j16, ASLR disabled: 5 loops
Disable 2 cores (3+3): 4 loops
Disable 4 cores (4+0): 6 loops
Testing in Ubuntu 16.04, there were too many dependencies/libraries to install for me to test compiling mesa from source under Ubuntu. That said, with default settings and compiling gcc 5.4 from source (with whatever gcc is installed with apt-get install gcc... I think it's 4.5.3) I also get segfaults after anywhere from 5-15 minutes of compiling.
I know nothing of compiling software in Windows, so I have not been able to test that.