2 of 2 people found this helpful
If you can disable OpCache, you'll find disable OpCache it fix the problem.
I've been testing this for 3 days.
After I disabled OpCache, now, I I tried the memory runs at 2400, then 2666, stable as it should.
So, I just wonder how much performance impact there will be after disable the OpCache.
I'm not sure if this is exactly what you want, but I would try resetting your BIOS to default settings (i.e. no overclock, let the MB handle the voltages etc.) and then try the following script:
It will create a file in the current directory called 'compile-loop.count' and then repeatedly emerge the mesa package (or whatever you tell it to as the first argument) and keep track of how many times it successfully compiles. The script passes the -B (build package only) flag to emerge so that it does not install the software each time.
Make sure /var/tmp/portage is mounted to tmpfs (and if you compile something big like gcc, make sure that it has enough room). I have 16GB of RAM and use the following in my /etc/fstab:
tmpfs /var/tmp/portage tmpfs size=8G,uid=portage,gid=portage,mode=775,noatime 0 0
By keeping the build directory (and package directory) in RAM, it avoids any excessive wear and tear on your drive.
On my MSI x370 gaming pro carbon board, the script would run for anywhere from 10 minutes to a couple of hours without crashing. With the same hardware, but the x370 Taichi board it can go for 24 hours without a segfault.
Thanks. I "loaded optimized defaults" in the bios, setup the ram drive and ran your script. It's a stock freq, voltage and memory.
I'll let you know how long it lasts.
Update. Died already. It smells funny too. I think it overheated. I have not so good cooling in my case. I'll open it up and try again...
Update 2: I rebooted, opened the case laid it down to get more airflow and it died almost immediately. I hope it's not permanent damage. That would suck.
Update 3: I raised vcoresoc (and a bunch of other specs for the hell of it) and it failed again within a few minutes... I think I was getting lucky before.
I'll start playing with options one at a time later this week.
It's clear that some users report problems, that don't relate to this bug. My system is perfectly stable, I don't see any MCE issues in dmesg, no random reboots or black screens. But there is definitely something wrong with gcc compilation and it's not clear is it a ryzen problem or just a gcc bug.
I've made additional tests with gcc 7.1:
1. Default: 7 fails in 309 builds.
2. ASLR disabled: 1 fail in 329 builds.
After that I've tested it back with gcc 5.4:
1. Default: 7 fails in 71 builds.
2. ASLR disabled: 0 fails in 509 builds.
New gcc version seem to fail more rarely. Also disabling ASLR don't seem to eliminate the issue, but it greatly reduce the probability of the fail. It's a pity that I don't have option to disable OpCache in UEFI settings.
gcc rarely segfaults; it is bash that crashes 99% of the times; so, or it is a combination of various bugs in various places or it is ryzen's fault; again: the real question is: why many users have no problem at all?
1 of 1 people found this helpful
My experience is opposite. In build log I'm getting messages like this with random source file:
./include/linux/slab.h:155:1: internal compiler error: Segmentation fault
void kzfree(const void *);
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.
With gcc 7.1 messages look like this:
drivers/gpu/drm/i915/i915_gpu_error.c: In function 'capture':
drivers/gpu/drm/i915/i915_gpu_error.c:1565:12: internal compiler error: Segmentation fault
static int capture(void *data)
0x75a488 get_inner_reference(tree_node*, long*, long*, tree_node**, machine_mode*, int*, int*, int*)
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
Interesting, that I don't see these segfaults in dmesg and Apport (Ubuntu bug report utility) don't handle them. Couple of times I've seen make crashes and they appear in dmesg and Apport creates crash report for them. But that could be just make problem that it don't always properly handle gcc failure.
1 of 1 people found this helpful
It's not easy to isolate though; so many variables (distro, base build system/compilers, ENV, etc etc). Heck, even within the same distro you can have variances and it doesn't make it easier for AMD to hunt down. Besides, I've experienced many issues after a new arch release and it takes some time for everyone (distro etc) to get caught up with workaround and/or fixes.
I wonder if something went wrong with the core Linux scheduling changes for Ryzen. Windows is working as far as I know. It would be interesting to see if gcc in windows is error free. Opcache and address space randomization seem to make the biggest difference. I wonder if Microsoft NUMA scheduler changes are preventing the problem. Seems like raise condition on chip. Yet windows does not trigger it?
Can you guys try an experiment for me?
Use mcl00's script from his post above to me: https://pastebin.com/c3Hk4qFh
Make sure you setup a ramdisk in /var/tmp/portage as he explains above.
Then reboot your machine, enter bios, load defaults, then change exactly two parameters:
Change the Core Vcc to -0.1 (mine reads -0.98 or something like that).
Change the core frequency to 800 MHz (0.8 GHz X8 multiplier).
Then boot your gentoo, make sure you have the ramdrive mounted and run mcl00's script with the time command:
Mine's been running for a few minutes now. mean time between failure before was on the order of 2 minutes so this looks promising albeit a bit ridiculous.
Assuming it fails and I believe it will, please run it a few times and record the times that the code ran and then get an average time.
Then reboot, reset the bios to defaults again, change the core frequency to 800 Mhz (don't change the core voltage) let the system boot at full speed and run that series of tests over again.
My hypothesis states that the mean time between failures will be much higher with the core voltage lowered.
If that's the case, it could be a hold time issue within the silicon.
Ah mine just failed after 17 minutes. I'll run the normal vcore next and update my post.
With core voltage set to stock, it ran for 8 minutes...
How about you guys?
> I wonder if something went wrong with the core Linux scheduling changes for Ryzen.
I can reproduce SEGV with the kernel both before that change and after that one. So
that change are not the root cause.
> I wonder if Microsoft NUMA scheduler changes are preventing the problem. Seems like raise condition on chip.
Windows binaries doesn't use ASLR by default. So it guess it reduce (not eliminate) the probability of SEGV.
> Yet windows does not trigger it?
As I reported several time, Windows Subsystem for Linux (WSL), so called "Bash on Ubuntu on Windows"
triggered this kind of problem (see my past report for the detail). WSL is the linux userland on WIndows kernel
(more precisely it consists of Linux emulation layer and NT kernel). And NetBSD triggered the very similar problem.
Yes, WSL is not completely so called Windows (consists of Windows userland, windows subsystem, and NT kernel),
at least it can say that this kind of problem happens on the different OS kernels. It's one of the reason why I consider
it's a hardware problem.
As I reported past, his logic based on the interaction between hyper threads in the same core.
However many users already reported SEGVs can't be eliminated with disabling hyperthread (SMT).
So his logic can't explain this problem, at least the all cases.
> Interesting, that I don't see these segfaults in dmesg and Apport (Ubuntu bug report utility) don't handle them.
> Couple of times I've seen make crashes and they appear in dmesg and Apport creates crash report for them.
> But that could be just make problem that it don't always properly handle gcc failure.
It's the same as my case (I'm a Ubuntu user). Linux kernel suppresses "segfault at..." message of processes
which handle SEGV by itself. Please see the following kernel source.
I know some users see that message about gcc, I guess it's because some distro's kernel tweaks kernel.
Please let me summarize "what component is wrong (I bet it's a Ryzen)" by taking account of
my past analysis and the facts that has reported here, because information gets complicated.
Please sorry to repeat my opinion which are described at the previous my posts.
Apparently gcc is not the root cause. If there are some bad code in gcc, it always fail at the same
place under the same compilation workload. In addition, since each compilation process (cc1) is
a single thread and a single process, probably this problem is not a timing issue of gcc itself.
Furthermore, some guys reports dying other processes like bash under that load, this phenomenon
can't be explained by gcc's bug.
Second, any user processes is not the root cause. As I reported, at least in my case, this reason of
SEGVs is the unknown General Protection Fault, user processes can't cause it by itself. So the suspect
is the linux kernel or hardwares.
Third, I don't think it's a kernel problem since very similar problems happened on other OS kernels
like Windows Subsystems for Linux (so called WSL or Bash on Ubuntu on Windows) and NetBSD.
If this problem is not caused by the hardware issues, it means all that OSes have very similar
problems. I don't think such kind of coincidence, it's really rare case.
Fourth, I don't think it's a other hardwares than Ryzen (CPU). There are plenty of hardware combinations
which causes this problem. The only one common hardware is Ryzen. The possibility of many hardwares
having the same problem is higher than the one hardware, in this case Ryzen, have a problem.
Yes, I admit this is a weak logic than the above mentioned ones.
Has anyone seen anything like this:
[21273.142643] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[21273.142643] Modules linked in: joydev hid_logitech_hidpp hid_logitech_dj binfmt_misc snd_hda_codec_hdmi nvidia_drm(PO) nvidia_modeset(PO) snd_hda_codec_
realtek snd_hda_codec_generic nvidia(PO) uvcvideo snd_hda_intel videobuf2_vmalloc edac_mce_amd videobuf2_memops snd_hda_codec ppdev edac_core videobuf2_v4l
2 videobuf2_core snd_hwdep snd_hda_core crct10dif_pclmul crc32_pclmul videodev crc32c_intel snd_pcm ghash_clmulni_intel pcspkr drm_kms_helper media r8169 s
nd_timer snd drm sp5100_tco mii soundcore i2c_piix4 ccp parport_pc parport i2c_designware_platform tpm_infineon i2c_designware_core shpchp wmi acpi_cpufreq
tpm_tis tpm_tis_core tpm
[21273.142650] CPU: 0 PID: 11699 Comm: sh Tainted: P O 4.11.3-gentoo #2
[21273.142651] Hardware name: Gigabyte Technology Co., Ltd. Default string/AB350M-D3H-CF, BIOS F1 02/20/2017
[21273.142651] task: ffff9130ea5b0000 task.stack: ffffb31255e94000
[21273.142651] RIP: 0033:0x45c224
[21273.142651] RSP: 002b:00007ffd16c25398 EFLAGS: 00000206
[21273.142652] RAX: 0000000000000000 RBX: 0000fd854b5c81c0 RCX: 0000000000000000
[21273.142652] RDX: 00000000023bf700 RSI: 0000000000000001 RDI: 00000000023a3c80
[21273.142652] RBP: 0000000000000000 R08: 00000000ffffffff R09: 00000000023bf740
[21273.142652] R10: 0000000000000010 R11: 0000000000000000 R12: 0000000002307000
[21273.142653] R13: 0000000000000000 R14: 00000000023c59b0 R15: 0000000000000000
[21273.142653] FS: 00007f871bcef700(0000) GS:ffff91310e400000(0000) knlGS:0000000000000000
[21273.142653] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21273.142653] CR2: 00007f84672cb008 CR3: 00000003bb27e000 CR4: 00000000003406f0
I agree with you sat that there is likely something going on with the Ryzen CPU. But I am now at the point where I am not able to troubleshoot any further as I can't reproduce the error in any reasonable time frame. From my experience, Ryzen seems to be very sensitive to both RAM speeds and CPU/system voltages (and possibly heat too). Stressing the system with multi-threaded compiler tasks where any of those factors are not at 100% stable settings results in segfaults. But there is clearly something else at play here that seems to be related to OpCache / SMT / address-space randomization since disabling any or all of those three has a significant impact for a subset of people experiencing crashes. There are few cases of meaningful debug info e.g.:
- further clarified on the Phoronix forum here - https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads?p=955498#post955498
- and a couple of posts here - https://forums.gentoo.org/viewtopic-t-1061546-postdays-0-postorder-asc-start-175.html
These all suggest there is a problem that should be reproducible in a fairly consistent manner with the right code. Unfortunately I do not have sufficient technical knowledge to do so myself.
For the record, despite my system being much (much much) more stable with the ASRock Taichi compared to my initial motherboard, I have still experienced some odd behaviour. I was able to compile mesa for 12 hours, followed by compiling gcc6.3 for 12 hours with no segfaults. But then at some point the next night while recompiling my entire system (to use -march=znver1, since I had previously recompiled everything to generic x86_64 code) I did encounter a segfault. Interestingly, after that point, my system was highly unstable and would segfault within minutes of starting any new compiles. Powering down and restarting the system fixed that and it was once again stable to compile for hours. I have also had one instance of the system locking up, though I don't recall what it was doing at the time (if anything).
As an aside, it would be very nice at this point to have some official acknowledgement from AMD as the last communication was quite some time ago at this point.