This is almost word to word what I'm seeing.
Disabling SMT didn't remove the problem and there's no option to disable OPCache on MSI Mortar Arctic. Setting the LLC to 5/8 didn't help either.
> In the Asus BIOS there is an option called called OPCache Control. Disabling this may resolve this issue.
> Another suggestion is to try disabling SMT. Look for an option in the Bios called 'Disable SMT'.
Thank you for sharing this information. However, as I said at May 23, 2017 6:14 PM in this thread,
the above mentioned workarounds didn't work for me. In addition, although my BIOS is ASUS PRIME X370 pro,
there would be no OPCache option.
Is AMD trying to make a new AGESA fixes this problem, not only finding workarounds which may work for some people?
I am having exactly the same issues as the original poster (and the numerous others that have posted on the gentoo forum linked in the original message). I do not have an OP Cache setting in my motherboard (MSI X370 Gaming Pro Carbon) so I am not able to disable it. Turning off SMT does not fix the problem.
I have tried various combinations of the following with little to no effect:
Varying clock speeds and timings of my RAM (Corsair 3200MHz) as low as 1866MHz CL 16.
Using the "performance" CPU governor.
Various LLC settings from auto (off?) through 4.
Setting the NB voltage up to 1.15V
While the problem encountered is 'random' segmentation faults in that they do not occur in any fixed memory address or particular part of a compile, the system will very consistently crash / segfault in any highly multi-threaded process that uses a lot of RAM. To reproduce the issue, I simply loop through compiling mesa 17.0 with -j16 and the build directory mounted to tmpfs (i.e. a ramdisk location for the build files). If I make it past 10 minutes without a segfault it's a lucky run.
I can't monitor CPU temperatures within Linux yet, but this does not appear to be heat related - cool ambient temperatures with the case open and a room fan blowing directly into the case did not increase the stability to any noticeable degree (and the CPU temperatures in Windows running prime95 with 16 workers stay reasonable).
Note that this problem is not limited to compilation tasks in Linux - prime95 will throw errors as well. It's just much less frequent (e.g. where compiling mesa in a ramdisk will segfault in minutes, prime95 can go for a few hours before complaining.)
I would really appreciate a response as "This question is Assumed Answered." is not true. The problem exists, and even if disabling SMT "fixed" it (which I'll repeat - it doesn't) that isn't an answer.
EDIT: I forgot to mention I also tried each of my memory sticks (2x8GB) independently without any improvement. If one stick was bad, you would expect to see segfaults with that stick but not the other. Both sticks together and each independently all display the same behaviour. Note that I haven't tried every combination of settings with every permutation of memory installed - just the default settings with the single DIMMs.)
Additionally, memtest86 will run through at least two cycles without error even with the RAM set at 3200MHz.
Since Gentoo is a source-based distribution, a significant amount of time setting up and/or updating the system involves compiling software packages, which increases the impact of this bug significantly for that community and is likely why you see the most comments from Gentoo users. I actually use Ubuntu as my primary OS, but I am able to reproduce the problem most consistently under Gentoo so that is what I have used to try out various BIOS tweaks. That said, to rule out OS-specific issues I did a test compile of gcc under Ubuntu 16.04 and was able to reproduce the problem (again, using make -j16). Under Ubuntu I didn't set the build directory up in a tmpfs mounted file system, but even running from my SSD and not from RAM I still can't consistently get through the full compile. I did once get it to compile twice in a row without a segfault, but that's the exception (and still not acceptable...)
To rule out a "Linux-specific" issue, I ran prime95 with 16 threads under Windows 10 to see if that was stable. As noted earlier it was not (although earlier I failed to mention I was running prime95 in Windows). Prime95 does run successfully for significantly longer than a multi-threaded compile, however, so I have not been using that to test BIOS settings.
To be fair to amdmatt's suggestions, disabling SMT is the one thing that makes the biggest difference for my system stability while compiling. My test compile of mesa-17.0 was able to successfully complete nearly 14 times in a row before crashing with SMT disabled and make reduced to -j8. That said, I still don't feel that this is an acceptable solution - I didn't buy a 16-thread processor to run it with half the threads disabled (and even then not be 100% sure that it's not going to crash, or corrupt my data).
> How many of you seeing problems are running Gentoo ? So far my impression is "most" at least...
As far as I know, actually most of them are Gentoo user. I guess It's because heavy compilation workload,
which causes this problem, is the daily work of Gentoo. I reproduced this problem on Ubuntu and
maxrussell(2017/06/01 17:33) is a CentOS user.
This problem is not a distro specific one.
I guess It's because heavy compilation workload, which causes this problem, is the daily work of Gentoo.
Hmm, good point. I had been thinking about Gentoo from the point of view of the compiler binaries having possibly been compiled with problematic compiler options or the kernel picking up some specific combination of patches but missed the "you do a lot of compiling" aspect.
I had been thinking about Gentoo from the point of view of the compiler binaries having possibly been compiled with problematic compiler options or the kernel picking up some specific combination of patches but missed the "you do a lot of compiling" aspect.
I don't think this is compiler options specific issue.
I have compiled everything on kernel 4.11.3 and gcc 6.3 with no `--march` option and `--mtune=generic` for generic amd64 CPUs, not only for Ryzen, but I still have this issue.
And also I think that heavy workload, especially compilation workload is NOT the reason of it but only brings it much faster.
How many of you seeing problems are running Gentoo ? So far my impression is "most" at least...
My Ryzen 1600 seems to be fine while having -j12 in MAKEOPTS, but it is just a few days old so it may be too early to tell. The CPU isn't overclocked.
I was experiencing Linux boot issues after I installed Ryzen, but those seem to have been resolved. Windows 10 is booting and working fine.
Also running into this intermittently on fc25 while compiling kernels, even as the system's rock stable otherwise (per prime95 torture, memtest).
Using a ab350pro4 with most recent bios (agesa 1004 based), so no access to LLC or opcache settings. Ryzen 1600 at stock voltage+speeds initially, currently at a tested-stable/modest OC of 3.7&3.25v. CMK16GX4M2B3000C15 @ 2933 MHz XMP profile.