AnsweredAssumed Answered

Dual AMD EPYC 7302: fatal cache error

Question asked by bnagy on Sep 21, 2020
Latest reply on Oct 6, 2020 by mbaker_amd

Hello!

I am working with a Tyan TN83-B8251 server with:
2x AMD EPYC 7302,
8x Kingston KSM32RS8/8MEI ECC enabled (tried 4x Micron MTA9ADF1G72PZ-3G2E1 same result)
BIOS version: V1.02.B10 (latest)

Debian 10.5, Kernel version 4.19.0-10amd64

I am getting periodic crashes/automatic reboots:

From dmesg:

[    4.518157] BERT: Error records from previous boot:
[    4.518158] [Hardware Error]: event severity: fatal
[    4.518159] [Hardware Error]:  Error 0, type: fatal
[    4.518159] [Hardware Error]:  fru_text: ProcessorError
[    4.518160] [Hardware Error]:   section_type: IA32/X64 processor error
[    4.518161] [Hardware Error]:   Local APIC_ID: 0x88
[    4.518161] [Hardware Error]:   CPUID Info:
[    4.518163] [Hardware Error]:   00000000: 00830f10 00000000 88200800 00000000
[    4.518164] [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[    4.518165] [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[    4.518165] [Hardware Error]:   Error Information Structure 0:
[    4.518166] [Hardware Error]:    Error Structure Type: cache error
[    4.518166] [Hardware Error]:    Check Information: 0x0000000026c2009f
[    4.518167] [Hardware Error]:     Transaction Type: 2, Generic
[    4.518167] [Hardware Error]:     Operation: 0, generic error
[    4.518168] [Hardware Error]:     Level: 3
[    4.518168] [Hardware Error]:     Processor Context Corrupt: true
[    4.518169] [Hardware Error]:     Uncorrected: true
[    4.518169] [Hardware Error]:     Overflow: true
[    4.518170] [Hardware Error]:   Context Information Structure 0:
[    4.518170] [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[    4.518170] [Hardware Error]:    Register Array Size: 0x0050
[    4.518171] [Hardware Error]:    MSR Address: 0xc00020e1
[    4.518171] [Hardware Error]:    Register Array:
[    4.518172] [Hardware Error]:    00000000: fea020000004010b 0400fffdf7049bc0
[    4.518172] [Hardware Error]:    00000010: d01c0ff500000000 000000050000007f
[    4.518173] [Hardware Error]:    00000020: 000700b021750300 0001f0a91d470302
[    4.518173] [Hardware Error]:    00000030: 0000000000000000 0000000000000000
[    4.518174] [Hardware Error]:    00000040: 0010000000000000 0000000000000000

 

What could cause these reboots?

 

Thank you in advance!

Attachments

Outcomes