cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

ghueller
Journeyman III

Linux on 3700x: spontaneous reboots caued by MCE

Hi,

I am running Linux (Fedora 31) on my build from last July, consisting of:
- Crucial DDR4 3000 Sticks
- Radeon RX 570 (MSI)
- Asrock Phantom Gaming 4 (latest BIOS)
- Ryzen 3700x

The system is fast and - at least under windows 10 running fine.
Temps are ok, PSU is of high quality, memory sustains yours of memtest86 witout errors.

Yet, when running Linux, I get a short freeze followed by a reboot about once a week.
At the next boot, the following machine check exception is logged:

[    0.707393] mce: [Hardware Error]: Machine check events logged
[    0.707395] mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5: bea0000000000108
[    0.707464] mce: [Hardware Error]: TSC 0 ADDR 1ffffbb03343c MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.707540] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288 SOCKET 0 APIC 5 microcode 8701013
[    0.709397] mce: [Hardware Error]: Machine check events logged
[    0.709398] mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
[    0.709468] mce: [Hardware Error]: TSC 0 ADDR 1ffffbba3a05a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.709543] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288 SOCKET 0 APIC 9 microcode 8701013


AMD support more or less aborts any communication as soon as they read over the term "linux".
Any idea how to diagnose this issue any further?

Thank you in advance, Gerhard

0 Likes
3 Replies
ghueller
Journeyman III

could please someone from AMD have a look at this issue.

Just had anouther one five minutes ago:

Mär 19 08:22:35 localhost.localdomain kernel: mce: [Hardware Error]: Machine check events logged
Mär 19 08:22:35 localhost.localdomain kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
Mär 19 08:22:35 localhost.localdomain kernel: mce: [Hardware Error]: TSC 0 ADDR 7fd8b0e13c9e MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Mär 19 08:22:35 localhost.localdomain kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1584602553 SOCKET 0 APIC 6 microcode 8701013

0 Likes

The bea0000000000108 and microcode 8701013 may be solvable by booting with amdgpu.ppfeaturemask=0xffffbffd. See https://bugzilla.kernel.org/show_bug.cgi?id=206903#c135
0 Likes
meaningfulusername
Journeyman III

Hi Gerhard,

I have a similar behavior with Debian Bullseye:

- HyperX Predator DDR 4 3200
- Radeon R9 270X (Asus)
- Asrock Phantom Gaming 4 (BIOS 4.00)
- Ryzen 9 5800X

The crashes occur several times per week. On the latest crash, I got the following log:

[ 0.183215] mce: [Hardware Error]: Machine check events logged
[ 0.183218] mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000000000108
[ 0.183222] mce: [Hardware Error]: TSC 0 ADDR 7fb5f2fced58 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[ 0.183226] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1623936609 SOCKET 0 APIC 1 microcode a201009

Also, I received the following messages on the terminal a few days ago - though without crash/reboot:
[ 2495.481585] [Hardware Error]: Deferred error, no action required.
[ 2495.481588] [Hardware Error]: CPU:1 (19:21:0) MC27_STATUS[-|-|MiscV|-|-|-|UECC|Deferred|Poison|Scrub]: 0x8948fd894855cc89
[ 2495.481591] [Hardware Error]: IPID: 0x0000000000000000
[ 2495.481592] [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 21
[ 2495.481593] [Hardware Error]: cache level: L1, tx: GEN

Investigating the last bunch of errors, I found several reports about faulty CPUs in the internet, being resolved by replacing the CPU. Unfortunately I do not have another AM4 compatible CPU that I could try.

Did you have any success with the boot options that were mentioned here?
Also, it might be worth mentioning that we both use the same mainboard, perhaps this is the origin of the issue...

0 Likes