Having random reboot issues with Ryzen 7 1700 when XFR is enabled.
Disabling Turbo, disabling C6, manually set frequency at 3.2GHz, enabling LLC and increasing core voltage to 1.25v seems to help workaround the issue.
I have been using "sensors" command with https://github.com/groeck/nct6775 driver to watch the voltage of CPU core on my MSI B350M Mortar motherboard.
With default BIOS settings the core voltage sometimes goes up to 1.35v but for mostly it is running in 1.09V. Nothing happened.
With some random changes with BIOS settings the core voltage sticks below 1.19v and never goes up to 1.20v. The problem occurs.
I tried the options you listed (no turbo, no C6, 3.2GHz, LLC and increasing the core voltage) and was still able to generate the segfaults while compiling software, so unfortunately that didn't work for me.
Quick update - i think my earlier error with prime95 in Windows may have been unrelated to the current issue. It could have been heat-related, or possibly I was still trying out 3200MHz with my memory. In any case, I have not been able to reproduce errors with prime95 (v29.1). Windows 10 (9hrs), Gentoo (3.5+hrs) and Ubuntu (2+hrs) can all run prime95 to their hearts content without errors. (The numbers in brackets are the time I left it to run - in no case did prime95 fail a test). These times running prime95 are all with the CPU at stock settings and the RAM at 2933MHz which is the highest 'stable' speed I can get my memory going.
No amount of fiddling with BIOS settings has resulted in a stable system for compiling software for me. I can also confirm that the problem is not heat related though as my AIO water cooler arrived yesterday and after installing it my CPU temps have dropped by almost 20C with no discernible impact on stability while compiling.
There is still the possibility that this is caused by a bug in Linux, as I have not 'crashed' Windows in the same way. I would like to eliminate that possibility, but I don't know how simulate similar CPU activity under Windows. Windows 10 will run prime95, Passmark's benchmark, the free versions of 3Dmark and others with no problems, but these don't really do the same type of activity as compiling large software packages under Linux which tends to be heavily multithreaded for a 3-4 minutes as a bunch of source files get compiled into object files, and then a short period 20s-1m where it's mostly single-threaded but with lots of memory shuffling around as the object files get linked, then back to multithreaded as it compiles the next batch of source files.
I had got segfaults with bash and glibc, so I recompiled them with newest version from ubuntu zesty (I'm runnning ubuntu xenial). No segfaults then, but sudden reboot still exists unless raising voltage.
Rasing core voltage as well as NB/SoC voltage.
Run memtest86+ and see if your memory kit have problems.
My mobo has option to set a fixed core voltage instead of dynamic voltage (offset). You should try that option.
As suggested on phoronix forum, I tried /proc/sys/kernel/randomize_va_space set to 0 and it seems to do the trick for me, please, try it.
(Of course it should be considered just as starting point to investigate the iusse)
> As suggested on phoronix forum, I tried /proc/sys/kernel/randomize_va_space set to 0
> and it seems to do the trick for me, please, try it. (Of course it should be considered just as starting point to investigate the iusse)
In my case, this workaround seems to work. Before trying this, I could reproduce this problem at least once per ten linux kernel build (make -j16).
However, after trying this, that build worked fine 100 times without SEGV.
Just a note about overclocking voltages:
MSI X370 SLI Plus BIOS contains a button that overclocks the CPU (Ryzen 5 1600) from 3.2GHz to 3.6GHz and changes the fan envelope. An unexpected issue is that turning the button off does not lower the core voltage (Vcore) back to normal levels, that is back from 1.464 Volt to 1.2 Volt. Normal voltage is restored back to 1.2 Volt by clearing the CMOS.
3.2GHz @ 1.464V is unstable (CPU hits 95℃ and gets automatically throttled from 3.2GHz to about 2.7GHz during stress testing in AIDA64), and 3.2GHz @ 1.2V is stable (max CPU temperature during AIDA64 stress test is 76℃). I didn't test 3.6GHz @ 1.464V, but I would expect the system to be unstable at this voltage as well.
FYI about this workaround.
a) Its effect
> > As suggested on phoronix forum, I tried /proc/sys/kernel/randomize_va_space set to 0
> > and it seems to do the trick for me, please, try it. (Of course it should be considered just as starting point to investigate the iusse)
> In my case, this workaround seems to work. Before trying this, I could reproduce this problem at least once per ten linux kernel build (make -j16).
> However, after trying this, that build worked fine 100 times without SEGV.
Unfortunately a person in the Phoronix said echo 0 >/proc/sys/kernel/randomize_va_space, it means
disabling ASLR, couldn't bypass this problem.
b) Its logic
This workaround, disabling ASLR, is based on the following Matt's logic.
If it's correct, this problem should disappear after disabling SMT too.
However, in my case, it didn't. I'm asking him why he found that logic.