Hello!
I am working with a Tyan TN83-B8251 server with:
2x AMD EPYC 7302,
8x Kingston KSM32RS8/8MEI ECC enabled (tried 4x Micron MTA9ADF1G72PZ-3G2E1 same result)
BIOS version: V1.02.B10 (latest)
Debian 10.5, Kernel version 4.19.0-10amd64
I am getting periodic crashes/automatic reboots:
From dmesg:
[ 4.518157] BERT: Error records from previous boot:
[ 4.518158] [Hardware Error]: event severity: fatal
[ 4.518159] [Hardware Error]: Error 0, type: fatal
[ 4.518159] [Hardware Error]: fru_text: ProcessorError
[ 4.518160] [Hardware Error]: section_type: IA32/X64 processor error
[ 4.518161] [Hardware Error]: Local APIC_ID: 0x88
[ 4.518161] [Hardware Error]: CPUID Info:
[ 4.518163] [Hardware Error]: 00000000: 00830f10 00000000 88200800 00000000
[ 4.518164] [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[ 4.518165] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
[ 4.518165] [Hardware Error]: Error Information Structure 0:
[ 4.518166] [Hardware Error]: Error Structure Type: cache error
[ 4.518166] [Hardware Error]: Check Information: 0x0000000026c2009f
[ 4.518167] [Hardware Error]: Transaction Type: 2, Generic
[ 4.518167] [Hardware Error]: Operation: 0, generic error
[ 4.518168] [Hardware Error]: Level: 3
[ 4.518168] [Hardware Error]: Processor Context Corrupt: true
[ 4.518169] [Hardware Error]: Uncorrected: true
[ 4.518169] [Hardware Error]: Overflow: true
[ 4.518170] [Hardware Error]: Context Information Structure 0:
[ 4.518170] [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
[ 4.518170] [Hardware Error]: Register Array Size: 0x0050
[ 4.518171] [Hardware Error]: MSR Address: 0xc00020e1
[ 4.518171] [Hardware Error]: Register Array:
[ 4.518172] [Hardware Error]: 00000000: fea020000004010b 0400fffdf7049bc0
[ 4.518172] [Hardware Error]: 00000010: d01c0ff500000000 000000050000007f
[ 4.518173] [Hardware Error]: 00000020: 000700b021750300 0001f0a91d470302
[ 4.518173] [Hardware Error]: 00000030: 0000000000000000 0000000000000000
[ 4.518174] [Hardware Error]: 00000040: 0010000000000000 0000000000000000
What could cause these reboots?
Thank you in advance!
did you find the cause of the problem ? i have the same random reboots on 7702s..
No. I did not find it. Checked memory, power no clue where this crash comming from.
Hello bnagy,
After reviewing your dmesg logs, you have a failing processor. The MSR 0xc00020e1 is pointing to the MCA_STATUS_L3 register, and after decoding that you are getting uncorrectable errors in your L3 cache. Please work with your vendor to get an RMA issued.
Hey mbaker_amd!
How did you decode this? Any hints on this ?
MSR Address points to 0xc00002161
Register Array Size 0x0050
Check Information is: 0x0000000006c2001f
I'd be greatful for any hints on how to make sense of these errors.
Hello again bnagy,
You can reference the PPR for the AMD EPYC 7002 processors, located here: https://developer.amd.com/wp-content/resources/55803_B0_PUB_0_91.pdf
Take the following:
[ 4.518171] [Hardware Error]: MSR Address: 0xc00020e1
[ 4.518171] [Hardware Error]: Register Array:
[ 4.518172] [Hardware Error]: 00000000: fea020000004010b 0400fffdf7049bc0
The MSR address for the error handling is pinging to the 0xc00020e1, or the MCA_STATUS_L3 register (page 250 of the PPR). Then you need to decode the first register, fea020000004010b.
Hope this helps.