I achieved relative stability. I found out that BIOS settings are not working because I disabled SMT but couldn't turn it back on. So I did a CMOS clear. and that fixed SMT.
Then I thought, well if the BIOS is unreliable I need a new benchmark. So I cleared the CMOS again, booted and and did a new benchmark.
After 10 runs, the average time between failures is 341 seconds
Then I rebooted, cleared CMOS again and the only change I made in BIOS was to disabled SMT.
It has been running for 12 hours without fail.
I think there really is an SMT problem. But I suspect there are other problems as well.
I'm going to crank my memory clock next to see if it can run as fast as it had been but with stability. If it fails, I'll clear CMOS, disable SMT and see if stability returns.
Update: This is horrific. I cleared the CMOS and now I can't get into BIOS. Apparently as soon as I press delete to enter BIOS, USB crashes and my keyboard and mouse stop working. Straight USB keyboard doesn't work either. Front or back ports. It will boot, but I can't change any settings...
Man... Here I thought I was onto something.
Update: I think I damaged the mobo trying to clear the CMOS... I'll see if I can find the damage and repair it today. That's what I get for using a giant screw driver in a tight spot without enough light and in a hurry.
Update: MOBO looks fine. When system is cold it won't post. It resets over and over and will not stop until I hold the power button down and force it off. Then to get it to boot I can clear the CMOS and then it will boot. I can't get into the BIOS settings though...
1 of 1 people found this helpful
That is quite what happened to me: it is totally random and surprising; the moment you think you find some stability that is the moment the "bug" re-appears in all its glory. I even bought a brand new hd because I thought my ssds and hds failed. I installed a fresh gentoo stage 3 on it and recompiled it for 3 days without any problem. All of a sudden the bug said "hello" and everything started again. You might think it is a hot temp problem, nope: the segfaults started in the morning after a boot.
So, after 4 mbs, 1 psu, 3 rams and 1 hd, I decided to rma the cpu, send the last mb back and as last hope I am waiting for brand new components to arrive. It is since the middle of may I am fighting with "the bug" and it should have been a production machine (a machine I pay rents from work I do on it).
This situation is really strange.
Yes, but that's probably another problem. (my mobo/cpu is taichi x370/ryzen 1800x)
I've been working on debugging this and it seems to be related to kernel, not to CPU.
Haven't yet been able to reproduce this on vanilla kernel 4.12-rc5.
There are also (probably?) third problem with MCE (machine check exceptions).
Crashes persist, though, not depending on kernel version or disabled/enabled SMT.
> * What I will do the next
> Make a reproducer which is more simple and can cause this problem more quick
Unfortunately I've not created a better reproducer than the current one (building a heavy program).
However, I have a progress as a result of some experiences.
This problem can happen with just one core. So it can be said that this problem is not caused
just by the interaction with multiple cores.
Build a linux kernel (v4.11.5, defconfig) with the following script.
At this time, the number of LCPUs is limited from 16 (my Ryzen 1800X's max) to
8 (one CCX), 2 (one core 2 thread), 1 (one core one thread).
SEGV happened at the all cases. When the number of LCPUs becomes smaller,
the possibility of SEGV becomes lower too.
* What I'll do next
I'll keep trying to make a better reproducer. Since this problem happens with one core,
the possibility to create a reproducer becomes higher. Theoretically the workload
needed to hit SEGV can be get from core dump at SEGV with the current reproducer.
2 of 2 people found this helpful
Checking in from AMD. The vast majority of users using Ryzen for Linux code and development have reported very positive results.
A small number of users have reported some isolated issues and conflicting observations. AMD is working with users individually to understand and resolve these issues.
If you are having a specific issue, please raise a service request on: http://support.amd.com/en-us/contact/email-form.
Use “Ryzen Linux Forum Discussion” as the subject and an AMD support person will contact you.
Many opened a service request, I did. I had an email where someone asked me to confirm the components of my system and then nothin more. Anyway... You are saying that many people having the very same problem (not conflicting observations, but the very same bug: massive parallel compilation fails) are just unlucky people and there is no problem in the cpu? Can you confirm that?
Question: does this problem also occur with AMD graphics cards, or is it restricted to NVidia cards?
1 of 1 people found this helpful
This issue has occurred on my system with an AMD Rx 460 so I don't believe the issue is GPU-related.
But can you reproduce it at all at AMD? If not, you should maybe get some consumer grade test setups or so...
Thank you for your reply. I submitted a new support request.
> A small number of users have reported some isolated issues and conflicting observations. AMD is working with users individually to understand and resolve these issues.
> If you are having a specific issue, please raise a service request on: http://support.amd.com/en-us/contact/email-form.
Please keep in mind the you said as follows in the past.
> There is no need to open new tickets on this issue. We are investigating and as soon as there is any updates, i will let you all know in this thread.
I'd already opened a previous service request about this problem. After finding this thread, I moved here
since you said as mentioned above.
Update on my issue:
I was able to (hopefully?) resolve my issue by setting the SOC voltage to 1.185v* on my 1800X. I left my computer on overnight recompiling the Linux kernel and it hasn't failed yet. Before this change it would fail within the first few compilations. I am also overclocked to 3.9GHz@1.385vcore, but the issue occurred on stock clocks/voltages too so I'm not sure how relevant that is.
EDIT: fix voltage
1 of 1 people found this helpful
Do you remember what the original value for "SOC voltage" was? What is the exact name in the BIOS (because there is no such value in the latest beta BIOS for ASUS Prime B350-Plus)?
1 of 1 people found this helpful
I appreciate the reply. I have opened a support ticket as requested.
I do find this issue truly maddening since any time I think I've made progress towards resolving it, it pops back up. With my new motherboard, I could no longer trigger the segfaults in a timely fashion with my looping compile script. However, I can do so if I loop through compiling mesa while simultaneously running stress-ng with the --aggressive option. In general, the compiler/libc/sh will segfault within 15 minutes or so in that scenario. But, it will occasionally run for hours before segfault'ing which makes it truly frustrating to try to test any changes. You think you've hit on the solution, then bam, it dies again.
Recently I have tried:
Upgrading to kernel 4.11.6.
Enabling or disabling NUMA in the kernel
Enabling or disabling ASLR
Using the performance cpu scaling governor (normally I use ondemand)
Recompiling the system (gcc 6.3) with -O2 -pipe -march=znver1 -mtune=haswell compiler flags
Recompiling the system (gcc 6.3) with just -O2 -pipe
"Overclocking" the RAM to 2933 MHz
Overclocking the CPU to 3.7 GHz, VCore 1.25V (N.B. can run prime95 overnight without failure and without the CPU ever exceeding 50 degrees C)
Interestingly (though perhaps coincidentally), it has always been the compilation task that has segfaulted when I'm doing it with stress-ng running simultaneously. stress-ng always exits with "successful run in XYZ seconds...". For example, here is the most recent output from stress-ng after I had yet again thought I had found a solution but then the dreaded 'internal compiler error' happened after nearly 2 hours of compiling mesa on a loop.
vladimir ~ # stress-ng --cpu 16 --aggressive --perf stress-ng: info:  defaulting to a 86400 second run per stressor stress-ng: info:  dispatching hogs: 16 cpu stress-ng: info:  cache allocate: default cache size: 8192K ^Cstress-ng: info:  successful run completed in 6413.21s (1 hour, 46 mins, 53.21 secs) stress-ng: info:  cpu: stress-ng: info:  185,561,835,725,520 CPU Cycles 28.93 B/sec stress-ng: info:  206,017,728,042,416 Instructions 32.12 B/sec (1.110 instr. per cycle) stress-ng: info:  0 Cache References 0.00 sec stress-ng: info:  0 Cache Misses 0.00 sec stress-ng: info:  14,861,412,201,136 Stalled Cycles Frontend 2.32 B/sec stress-ng: info:  27,660,103,488,112 Stalled Cycles Backend 4.31 B/sec stress-ng: info:  40,164,506,861,184 Branch Instructions 6.26 B/sec stress-ng: info:  284,832,819,648 Branch Misses 44.41 M/sec ( 0.71%) stress-ng: info:  11,616 Page Faults Minor 1.81 sec stress-ng: info:  0 Page Faults Major 0.00 sec stress-ng: info:  26,505,696 Context Switches 4.13 K/sec stress-ng: info:  19,112,608 CPU Migrations 2.98 K/sec stress-ng: info:  0 Alignment Faults 0.00 sec stress-ng: info:  11,616 Page Faults User 1.81 sec stress-ng: info:  0 Page Faults Kernel 0.00 sec stress-ng: info:  976 System Call Enter 0.15 sec stress-ng: info:  960 System Call Exit 0.15 sec stress-ng: info:  0 TLB Flushes 0.00 sec stress-ng: info:  387,792 Kmalloc 60.47 sec stress-ng: info:  14,768 Kmalloc Node 2.30 sec stress-ng: info:  10,343,472 Kfree 1.61 K/sec stress-ng: info:  70,048 Kmem Cache Alloc 10.92 sec stress-ng: info:  14,784 Kmem Cache Alloc Node 2.31 sec stress-ng: info:  78,082,688 Kmem Cache Free 12.18 K/sec stress-ng: info:  11,312 MM Page Alloc 1.76 sec stress-ng: info:  1,319,952 MM Page Free 205.82 sec stress-ng: info:  248,618,560 RCU Utilization 38.77 K/sec stress-ng: info:  614,288 Sched Migrate Task 95.78 sec stress-ng: info:  0 Sched Move NUMA 0.00 sec stress-ng: info:  18,220,864 Sched Wakeup 2.84 K/sec stress-ng: info:  19,744 Signal Generate 3.08 sec stress-ng: info:  16 Signal Deliver 0.00 sec stress-ng: info:  961,600 IRQ Entry 149.94 sec stress-ng: info:  961,600 IRQ Exit 149.94 sec stress-ng: info:  83,290,240 Soft IRQ Entry 12.99 K/sec stress-ng: info:  83,290,240 Soft IRQ Exit 12.99 K/sec stress-ng: info:  0 Writeback Dirty Inode 0.00 sec stress-ng: info:  0 Writeback Dirty Page 0.00 sec