AnsweredAssumed Answered

Finding the source of sporadic crashes (AMD EPYC 7282)

Question asked by davrot on Jul 16, 2020
Latest reply on Jul 30, 2020 by davrot

Hello,

I have a new machine in my cluster that is sporadically crashing every few days and I am not able to determine why. I hope somebody has an idea why this is happening. 

The only hint about what is happening are the log entries I appended at the end. These crashes result in the OS to instantaneously getting a black screen, then it goes to POST and reboots. I hope that somebody has an idea how to find the corporate.

This is the story: We bought 8 machines with one AMD EPYC 7282 CPU each (including 8x Kingston 32GB RAM Modules per machine in ASRockRack EPYCD8-2T mainboards) for our machine learning cluster. All machines are identical down to the screws. Now for 95 days, seven machines are running perfectly without any problems. The 8th machine is crashing when ever it wants.

All the operation systems (CentOS 8.1) are installed via a script from an internal repository and should be exactly the same. I even reinstalled the sad computer several times.

I tested the RAM (Memtest86+ and in OS tests) again and again. I found no problems with the RAM. I had crashes with all the 8 RAM modules installed, with the first three RAM slots populated as well the last four memory slots populated. For keeping it consistent and preventing mixing, I install the RAM modules always in their "original" slot, where the assembler put them. Thus if the RAM is the reason then it needs to be at least two modules.   

The computers are in a server cold room (19°C) and their tower chassis are stuffed with many large fans (3x 12 cm in the front, 1x 9cm fan in the back, Noctua NH-U14S TR4-SP3 CPU cooler with two Noctua NF-A15 fans and Thermal Grizzly Kryonaut thermal paste). The CPU temperature (via the sensor tool) reads out between 20°C in idle and max 45°C under 100% load. Thus I rule out overheating. 

I did some torture test with mprime. No problem for days. Then I just reboot the system and it cashes directly after booting during idling. 

I have the rasdeamon running. After dozens of crashes this is the sad yield:

[->]
(base) [davrot@granat5 ~]$ ras-mc-ctl --summary
Memory controller events summary:
    Corrected on DIMM Label(s): 'unknown memory' location: 0:0:3:-1 errors: 2

PCIe AER events summary:
    14 Fatal errors: Poisoned TLP

No Extlog errors.
No MCE errors.
(base) [davrot@granat5 ~]$ ras-mc-ctl --errors
Memory controller events:
1 2020-05-06 18:05:22 +0200 1 Corrected error(s):  at unknown memory location: 0:0:3:-1, addr -1488575744, grain 1, syndrome 20568
2 2020-05-07 00:27:39 +0200 1 Corrected error(s):  at unknown memory location: 0:0:3:-1, addr -847838464, grain 1, syndrome 2939

PCIe AER events:
1 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
2 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
3 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
4 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
5 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
6 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
7 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
8 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
9 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
10 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
11 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
12 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
13 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
14 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP

No Extlog errors.

No MCE errors.
[<-]

I removed the mentioned memory module but it continued to crash anyway. And the Poisoned TLP could have be generated by some irregularity with the 10GB network switch we had that day. 

The only real hint are these log entries, I found after the crashes...

Here are the entries from the two latest crashes:

Jul 13 05:05:17 granat5 kernel: [Hardware Error]: event severity: fatal
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:  Error 0, type: fatal
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:  fru_text: ProcessorError
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   Local APIC_ID: 0x4
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   CPUID Info:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   00000000: 00830f10 00000000 04200800 00000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   Error Information Structure 0:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Error Structure Type: cache error
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Check Information: 0x000000000614001f
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Transaction Type: 0, Instruction
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Operation: 5, instruction fetch
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Level: 0
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Processor Context Corrupt: true
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Uncorrected: true
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   Context Information Structure 0:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Register Array Size: 0x0050
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    MSR Address: 0xc0002051
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Register Array:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000000: baa0000000090150 0000000000000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000010: d01c0ff500000000 0000000300000079
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000020: 000500b000000000 000000004d000002
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000030: 0000000000000000 0000000000000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000040: 0000000000000000 0000000000000000

Jul 10 23:43:07 granat5 kernel: [Hardware Error]: event severity: fatal
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:  Error 0, type: fatal
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:  fru_text: ProcessorError
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   Local APIC_ID: 0x4
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   CPUID Info:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   00000000: 00830f10 00000000 04200800 00000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   Error Information Structure 0:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Error Structure Type: cache error
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Check Information: 0x000000000614001f
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Transaction Type: 0, Instruction
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Operation: 5, instruction fetch
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Level: 0
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Processor Context Corrupt: true
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Uncorrected: true
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   Context Information Structure 0:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Register Array Size: 0x0050
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    MSR Address: 0xc0002051
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Register Array:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000000: baa0000000090150 0000000000000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000010: d01c0ff500000000 0000000300000079
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000020: 000500b000000000 000000004d000002
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000030: 0000000000000000 0000000000000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000040: 0000000000000000 0000000000000000

I am out of my depth and I hope that someone has an idea what is going on. I am trying to figure out the underlying reason for some weeks now.

If you have experienced something similar, then please accept my heartfelt condolences.

Thanks!

Outcomes