cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

bnagy
Journeyman III

Dual AMD EPYC 7302: fatal cache error

Hello!

I am working with a Tyan TN83-B8251 server with:
2x AMD EPYC 7302,
8x Kingston KSM32RS8/8MEI ECC enabled (tried 4x Micron MTA9ADF1G72PZ-3G2E1 same result)
BIOS version: V1.02.B10 (latest)

Debian 10.5, Kernel version 4.19.0-10amd64

I am getting periodic crashes/automatic reboots:

pastedImage_1.png

From dmesg:

[    4.518157] BERT: Error records from previous boot:
[    4.518158] [Hardware Error]: event severity: fatal
[    4.518159] [Hardware Error]:  Error 0, type: fatal
[    4.518159] [Hardware Error]:  fru_text: ProcessorError
[    4.518160] [Hardware Error]:   section_type: IA32/X64 processor error
[    4.518161] [Hardware Error]:   Local APIC_ID: 0x88
[    4.518161] [Hardware Error]:   CPUID Info:
[    4.518163] [Hardware Error]:   00000000: 00830f10 00000000 88200800 00000000
[    4.518164] [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[    4.518165] [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[    4.518165] [Hardware Error]:   Error Information Structure 0:
[    4.518166] [Hardware Error]:    Error Structure Type: cache error
[    4.518166] [Hardware Error]:    Check Information: 0x0000000026c2009f
[    4.518167] [Hardware Error]:     Transaction Type: 2, Generic
[    4.518167] [Hardware Error]:     Operation: 0, generic error
[    4.518168] [Hardware Error]:     Level: 3
[    4.518168] [Hardware Error]:     Processor Context Corrupt: true
[    4.518169] [Hardware Error]:     Uncorrected: true
[    4.518169] [Hardware Error]:     Overflow: true
[    4.518170] [Hardware Error]:   Context Information Structure 0:
[    4.518170] [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[    4.518170] [Hardware Error]:    Register Array Size: 0x0050
[    4.518171] [Hardware Error]:    MSR Address: 0xc00020e1
[    4.518171] [Hardware Error]:    Register Array:
[    4.518172] [Hardware Error]:    00000000: fea020000004010b 0400fffdf7049bc0
[    4.518172] [Hardware Error]:    00000010: d01c0ff500000000 000000050000007f
[    4.518173] [Hardware Error]:    00000020: 000700b021750300 0001f0a91d470302
[    4.518173] [Hardware Error]:    00000030: 0000000000000000 0000000000000000
[    4.518174] [Hardware Error]:    00000040: 0010000000000000 0000000000000000

What could cause these reboots?

Thank you in advance!

0 Likes
5 Replies
jjones
Journeyman III

did you find the cause of the problem ? i have the same random reboots on 7702s..

No. I did not find it. Checked memory, power no clue where this crash comming from.

0 Likes
Anonymous
Not applicable

Hello bnagy‌,

After reviewing your dmesg logs, you have a failing processor.  The MSR 0xc00020e1 is pointing to the MCA_STATUS_L3 register, and after decoding that you are getting uncorrectable errors in your L3 cache.  Please work with your vendor to get an RMA issued. 

0 Likes

Hey mbaker_amd!

How did you decode this? Any hints on this ?

MSR Address points to 0xc00002161

Register Array Size 0x0050

Check Information is: 0x0000000006c2001f

I'd be greatful for any hints on how to make sense of these errors.

0 Likes
Anonymous
Not applicable

Hello again bnagy‌,

You can reference the PPR for the AMD EPYC 7002 processors, located here:  https://developer.amd.com/wp-content/resources/55803_B0_PUB_0_91.pdf 

Take the following:

[    4.518171] [Hardware Error]:    MSR Address: 0xc00020e1
[    4.518171] [Hardware Error]:    Register Array:
[    4.518172] [Hardware Error]:    00000000: fea020000004010b 0400fffdf7049bc0

The MSR address for the error handling is pinging to the 0xc00020e1, or the MCA_STATUS_L3 register (page 250 of the PPR).  Then you need to decode the first register, fea020000004010b.  

  • bit 45 = 0x1 --> UECC, or uncorrectable error
  • bits 21:16 =0x04 --> DataArray

Hope this helps.

0 Likes