cancel
Showing results for 
Search instead for 
Did you mean: 

EPYC Discussions

h_o_l_g_e_r
Journeyman III

CPU 2 machine check error detected dual AMD EPYC 7H12

Hello,

I have a server with two EPYC 7H12 CPU that has been causing me a lot of problems for month, but was unable what exactly the cause is. A view days ago, when system just hang, it did log a machine check error on CPU 2. Support of the server told me that this type of error can be caused by any hard- or software. Only this morning I discovered a BERT log entry with the following content:

Oct 2 18:57:07 kernel: BERT: Error records from previous boot:
Oct 2 18:57:07 kernel: [Hardware Error]: event severity: fatal
Oct 2 18:57:07 kernel: [Hardware Error]: Error 0, type: fatal
Oct 2 18:57:07 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Oct 2 18:57:07 kernel: [Hardware Error]: Local APIC_ID: 0xc3
Oct 2 18:57:07 kernel: [Hardware Error]: CPUID Info:
Oct 2 18:57:07 kernel: [Hardware Error]: 00000000: 00830f10 00000000 c3800800 00000000
Oct 2 18:57:07 kernel: [Hardware Error]: 00000010: 76f8320b 00000000 178bfbff 00000000
Oct 2 18:57:07 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Oct 2 18:57:07 kernel: [Hardware Error]: Error Information Structure 0:
Oct 2 18:57:07 kernel: [Hardware Error]: Error Structure Type: cache error
Oct 2 18:57:07 kernel: [Hardware Error]: Check Information: 0x00000000061400ff
Oct 2 18:57:07 kernel: [Hardware Error]: Transaction Type: 0, Instruction
Oct 2 18:57:07 kernel: [Hardware Error]: Operation: 5, instruction fetch
Oct 2 18:57:07 kernel: [Hardware Error]: Level: 0
Oct 2 18:57:07 kernel: [Hardware Error]: Processor Context Corrupt: true
Oct 2 18:57:07 kernel: [Hardware Error]: Uncorrected: true
Oct 2 18:57:07 kernel: [Hardware Error]: Precise IP: false
Oct 2 18:57:07 kernel: [Hardware Error]: Restartable IP: false
Oct 2 18:57:07 kernel: [Hardware Error]: Overflow: false
Oct 2 18:57:07 kernel: [Hardware Error]: Context Information Structure 0:
Oct 2 18:57:07 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Oct 2 18:57:07 kernel: [Hardware Error]: Register Array Size: 0x0080
Oct 2 18:57:07 kernel: [Hardware Error]: MSR Address: 0xc0002050
Oct 2 18:57:07 kernel: [Hardware Error]: Register Array:
Oct 2 18:57:07 kernel: [Hardware Error]: 00000000: 0000000000000000 b2a0000000060150
Oct 2 18:57:07 kernel: [Hardware Error]: 00000010: 0000000000000000 d010000000000000
Oct 2 18:57:07 kernel: [Hardware Error]: 00000020: 0000000300000079 000500b000000000
Oct 2 18:57:07 kernel: [Hardware Error]: 00000030: 000000004d000004 0000000000000000
Oct 2 18:57:07 kernel: [Hardware Error]: 00000040: 0000000000000000 0000000000000000
Oct 2 18:57:07 kernel: [Hardware Error]: 00000050: 0000000000000000 0000000000000000
Oct 2 18:57:07 kernel: [Hardware Error]: 00000060: 0000000000000000 0000000000000000
Oct 2 18:57:07 kernel: [Hardware Error]: 00000070: 0000000000000000 0000000000000000
Oct 2 18:57:07 kernel: BERT: Total records found: 1
Oct 2 18:57:07 kernel: RAS: Correctable Errors collector initialized.

Is this a clear indication that the CPU is broken? Or is this a software error? I searched the internet how to decode such BERT output, but could not find anything.

The system does have latest BIOS and Software and runs with default settings. An identical server with the exact same hardware and software and the same load runs absolute stable and no problems.

Thanks in advance for any help!

0 Likes
0 Replies