cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

Jord
Journeyman III

EPYC 7302 crashes, hardware errors, micro-architectural error, MSR registers

I've been struck with several crashes per day on this system, with the following in the kernel log:

[Hardware Error]: event severity: fatal
[Hardware Error]:  Error 0, type: fatal
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x0
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 00200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: micro-architectural error
[Hardware Error]:    Check Information: 0x00000000009d0027
[Hardware Error]:     Error Type: 5, Internal Unclassified
[Hardware Error]:     Processor Context Corrupt: true
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Overflow: true
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002011
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: fe800800000c0859 06000000afbae000
[Hardware Error]:    00000010: d01c0ff500000000 0000000300000079
[Hardware Error]:    00000020: 000100b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 1, type: corrected
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x0
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 00200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: micro-architectural error
[Hardware Error]:    Check Information: 0x0000000000850021
[Hardware Error]:     Error Type: 5, Internal Unclassified
[Hardware Error]:     Overflow: true
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc00021b1
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: d82000000002080b 0000000000000000
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 0001002e00001e01 000000005a020011
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0010000000000000 0000000000000000
[Hardware Error]:  Error 2, type: fatal
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x1
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 01200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: micro-architectural error
[Hardware Error]:    Check Information: 0x00000000009d0027
[Hardware Error]:     Error Type: 5, Internal Unclassified
[Hardware Error]:     Processor Context Corrupt: true
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Overflow: true
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002011
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: fe800800000c0859 06000000afbae440
[Hardware Error]:    00000010: d01c0ff500000000 0000000300000079
[Hardware Error]:    00000020: 000100b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 3, type: fatal
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x2
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 02200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: micro-architectural error
[Hardware Error]:    Check Information: 0x00000000009d0027
[Hardware Error]:     Error Type: 5, Internal Unclassified
[Hardware Error]:     Processor Context Corrupt: true
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Overflow: true
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002011
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: fe800800000c0859 06000000afbae840
[Hardware Error]:    00000010: d01c0ff500000000 0000000300000079
[Hardware Error]:    00000020: 000100b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 4, type: fatal
[Hardware Error]:  fru_text: DIMM Locate:  P0D0
[Hardware Error]:   section_type: memory error
[Hardware Error]:   error_status: 0x0000000000040400
[Hardware Error]:   physical_address: 0x0000003223849580
[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 7 row: 36508 column: 592
[Hardware Error]:  Error 5, type: fatal
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x3
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 03200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: micro-architectural error
[Hardware Error]:    Check Information: 0x00000000009d0027
[Hardware Error]:     Error Type: 5, Internal Unclassified
[Hardware Error]:     Processor Context Corrupt: true
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Overflow: true
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002011
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: fe800800000c0859 06000000afbaec00
[Hardware Error]:    00000010: d01c0ff500000000 0000000300000079
[Hardware Error]:    00000020: 000100b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 6, type: recoverable
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x8
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 08200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: cache error
[Hardware Error]:    Check Information: 0x000000001c4d0077
[Hardware Error]:     Transaction Type: 1, Data Access
[Hardware Error]:     Operation: 3, data read
[Hardware Error]:     Level: 1
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Precise IP: true
[Hardware Error]:     Restartable IP: true
[Hardware Error]:    Instruction Pointer: 0x0000000000000011
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002001
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: bc002800000c0135 01000000afc01d00
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 000000b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 7, type: recoverable
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0xa
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 0a200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: cache error
[Hardware Error]:    Check Information: 0x000000001c4d0077
[Hardware Error]:     Transaction Type: 1, Data Access
[Hardware Error]:     Operation: 3, data read
[Hardware Error]:     Level: 1
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Precise IP: true
[Hardware Error]:     Restartable IP: true
[Hardware Error]:    Instruction Pointer: 0x0000000000000011
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002001
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: bc002800000c0135 01000000afc05c80
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 000000b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 8, type: fatal
[Hardware Error]:  fru_text: DIMM# Not Sourced
[Hardware Error]:   section_type: memory error
[Hardware Error]:   error_status: 0x0000000000041000
[Hardware Error]:   node: 0 card: 0
[Hardware Error]:   error_type: 8, parity error
[Hardware Error]:  Error 9, type: recoverable
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x10
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 10200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: cache error
[Hardware Error]:    Check Information: 0x000000001c4d0077
[Hardware Error]:     Transaction Type: 1, Data Access
[Hardware Error]:     Operation: 3, data read
[Hardware Error]:     Level: 1
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Precise IP: true
[Hardware Error]:     Restartable IP: true
[Hardware Error]:    Instruction Pointer: 0x0000000000000011
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002001
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: bc002800000c0135 01000000afc09d00
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 000000b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 10, type: recoverable
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x12
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 12200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: cache error
[Hardware Error]:    Check Information: 0x000000001c4d0077
[Hardware Error]:     Transaction Type: 1, Data Access
[Hardware Error]:     Operation: 3, data read
[Hardware Error]:     Level: 1
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Precise IP: true
[Hardware Error]:     Restartable IP: true
[Hardware Error]:    Instruction Pointer: 0x0000000000000011
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002001
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: bc002800000c0135 01000000afc0dc80
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 000000b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 11, type: recoverable
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x18
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 18200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: cache error
[Hardware Error]:    Check Information: 0x000000001c4d0077
[Hardware Error]:     Transaction Type: 1, Data Access
[Hardware Error]:     Operation: 3, data read
[Hardware Error]:     Level: 1
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Precise IP: true
[Hardware Error]:     Restartable IP: true
[Hardware Error]:    Instruction Pointer: 0x0000000000000011
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002001
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: bc002800000c0135 01000000afc11d00
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 000000b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 12, type: recoverable
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x1a
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 1a200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: cache error
[Hardware Error]:    Check Information: 0x000000001c4d0077
[Hardware Error]:     Transaction Type: 1, Data Access
[Hardware Error]:     Operation: 3, data read
[Hardware Error]:     Level: 1
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Precise IP: true
[Hardware Error]:     Restartable IP: true
[Hardware Error]:    Instruction Pointer: 0x0000000000000011
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002001
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: bc002800000c0135 01000000afc15c80
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 000000b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 13, type: fatal
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x20
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 20200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: micro-architectural error
[Hardware Error]:    Check Information: 0x00000000009d0027
[Hardware Error]:     Error Type: 5, Internal Unclassified
[Hardware Error]:     Processor Context Corrupt: true
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Overflow: true
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002011
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: fe800800000c0859 06000000afbb2140
[Hardware Error]:    00000010: d01c0ff500000000 0000000300000079
[Hardware Error]:    00000020: 000100b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000
[Hardware Error]:  Error 14, type: fatal
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x21
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 21200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: cache error
[Hardware Error]:    Check Information: 0x00000000064d001f
[Hardware Error]:     Transaction Type: 1, Data Access
[Hardware Error]:     Operation: 3, data read
[Hardware Error]:     Level: 1
[Hardware Error]:     Processor Context Corrupt: true
[Hardware Error]:     Uncorrected: true
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc0002001
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: be802800000c0135 01000000afbba300
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 000000b000000000 0000000000000000
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0000000000000000 0000000000000000

Clipped due to a max of 20,000 characters. I also regularly get PCIe errors reported on all devices, and even errors saying that ECC errors have occured, on a different DIMM more or less each time. Considering how many of these issues are directly tied to the CPU, is it the one most likely dying?

0 Likes
3 Replies
Anonymous
Not applicable

Hello @Jord ,

Can you run the following and post the output:

  • dmesg | grep -i error

We would like to see more of the errors being reported.

 

Can you also reseat all of your DIMMs?  And have you reviewed, and ensured you're follow, our DIMM population guide?  https://developer.amd.com/wp-content/resources/56502_1.00-PUB.pdf

0 Likes
saywhut
Challenger

Are you running this on an OEM box?  If so, check the out-of-band management controller like the Dell iDRAC or HP iLO, they will also give you lots more information about hardware failures and they typically will offer support as well.

0 Likes
Anonymous
Not applicable

Hello @Jord ,

If none of the other suggestions have helped, I suggest reaching out to your platform provider.  This could be a voltage regulator going bad just as much as it could be faulty silicon from what you've shared so far.  The platform BMC might provide more details as well to help diagnose.

0 Likes