cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

davrot
Journeyman III

Finding the source of sporadic crashes (AMD EPYC 7282)

Jump to solution

Hello,

I have a new machine in my cluster that is sporadically crashing every few days and I am not able to determine why. I hope somebody has an idea why this is happening. 

The only hint about what is happening are the log entries I appended at the end. These crashes result in the OS to instantaneously getting a black screen, then it goes to POST and reboots. I hope that somebody has an idea how to find the corporate.

This is the story: We bought 8 machines with one AMD EPYC 7282 CPU each (including 8x Kingston 32GB RAM Modules per machine in ASRockRack EPYCD8-2T mainboards) for our machine learning cluster. All machines are identical down to the screws. Now for 95 days, seven machines are running perfectly without any problems. The 8th machine is crashing when ever it wants.

All the operation systems (CentOS 8.1) are installed via a script from an internal repository and should be exactly the same. I even reinstalled the sad computer several times.

I tested the RAM (Memtest86+ and in OS tests) again and again. I found no problems with the RAM. I had crashes with all the 8 RAM modules installed, with the first three RAM slots populated as well the last four memory slots populated. For keeping it consistent and preventing mixing, I install the RAM modules always in their "original" slot, where the assembler put them. Thus if the RAM is the reason then it needs to be at least two modules.   

The computers are in a server cold room (19°C) and their tower chassis are stuffed with many large fans (3x 12 cm in the front, 1x 9cm fan in the back, Noctua NH-U14S TR4-SP3 CPU cooler with two Noctua NF-A15 fans and Thermal Grizzly Kryonaut thermal paste). The CPU temperature (via the sensor tool) reads out between 20°C in idle and max 45°C under 100% load. Thus I rule out overheating. 

I did some torture test with mprime. No problem for days. Then I just reboot the system and it cashes directly after booting during idling. 

I have the rasdeamon running. After dozens of crashes this is the sad yield:

[->]
(base) [davrot@granat5 ~]$ ras-mc-ctl --summary
Memory controller events summary:
    Corrected on DIMM Label(s): 'unknown memory' location: 0:0:3:-1 errors: 2

PCIe AER events summary:
    14 Fatal errors: Poisoned TLP

No Extlog errors.
No MCE errors.
(base) [davrot@granat5 ~]$ ras-mc-ctl --errors
Memory controller events:
1 2020-05-06 18:05:22 +0200 1 Corrected error(s):  at unknown memory location: 0:0:3:-1, addr -1488575744, grain 1, syndrome 20568
2 2020-05-07 00:27:39 +0200 1 Corrected error(s):  at unknown memory location: 0:0:3:-1, addr -847838464, grain 1, syndrome 2939

PCIe AER events:
1 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
2 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
3 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
4 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
5 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
6 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
7 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
8 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
9 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
10 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
11 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
12 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
13 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
14 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP

No Extlog errors.

No MCE errors.
[<-]

I removed the mentioned memory module but it continued to crash anyway. And the Poisoned TLP could have be generated by some irregularity with the 10GB network switch we had that day. 

The only real hint are these log entries, I found after the crashes...

Here are the entries from the two latest crashes:

Jul 13 05:05:17 granat5 kernel: [Hardware Error]: event severity: fatal
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:  Error 0, type: fatal
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:  fru_text: ProcessorError
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   Local APIC_ID: 0x4
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   CPUID Info:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   00000000: 00830f10 00000000 04200800 00000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   Error Information Structure 0:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Error Structure Type: cache error
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Check Information: 0x000000000614001f
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Transaction Type: 0, Instruction
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Operation: 5, instruction fetch
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Level: 0
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Processor Context Corrupt: true
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:     Uncorrected: true
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:   Context Information Structure 0:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Register Array Size: 0x0050
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    MSR Address: 0xc0002051
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    Register Array:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000000: baa0000000090150 0000000000000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000010: d01c0ff500000000 0000000300000079
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000020: 000500b000000000 000000004d000002
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000030: 0000000000000000 0000000000000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]:    00000040: 0000000000000000 0000000000000000

Jul 10 23:43:07 granat5 kernel: [Hardware Error]: event severity: fatal
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:  Error 0, type: fatal
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:  fru_text: ProcessorError
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   Local APIC_ID: 0x4
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   CPUID Info:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   00000000: 00830f10 00000000 04200800 00000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   Error Information Structure 0:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Error Structure Type: cache error
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Check Information: 0x000000000614001f
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Transaction Type: 0, Instruction
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Operation: 5, instruction fetch
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Level: 0
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Processor Context Corrupt: true
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:     Uncorrected: true
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:   Context Information Structure 0:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Register Array Size: 0x0050
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    MSR Address: 0xc0002051
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    Register Array:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000000: baa0000000090150 0000000000000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000010: d01c0ff500000000 0000000300000079
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000020: 000500b000000000 000000004d000002
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000030: 0000000000000000 0000000000000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]:    00000040: 0000000000000000 0000000000000000

I am out of my depth and I hope that someone has an idea what is going on. I am trying to figure out the underlying reason for some weeks now.

If you have experienced something similar, then please accept my heartfelt condolences.

Thanks!

0 Likes
1 Solution

Accepted Solutions
mbaker_amd
Staff
Staff

Re: Finding the source of sporadic crashes (AMD EPYC 7282)

Jump to solution

Hello davrot‌,

This appears to be a bad processor.  Please work with your reseller/OEM to process an RMA on the part.

View solution in original post

4 Replies
mbaker_amd
Staff
Staff

Re: Finding the source of sporadic crashes (AMD EPYC 7282)

Jump to solution

Hello davrot‌,

This appears to be a bad processor.  Please work with your reseller/OEM to process an RMA on the part.

View solution in original post

hardcoregames_
Big Boss

Re: Finding the source of sporadic crashes (AMD EPYC 7282)

Jump to solution

I would try tweaking the memory timing and see if that fixes the error reports

Try relaxing the RAM timing incrementally

Check for a motherboard BIOS update too

davrot
Journeyman III

Re: Finding the source of sporadic crashes (AMD EPYC 7282)

Jump to solution

Thank you very much!

I will investigate the RAM timing / BIOS version first and if this doesn't solve the problem, I will RMA the CPU. 

0 Likes
mgw007
Journeyman III

Re: Finding the source of sporadic crashes (AMD EPYC 7282)

Jump to solution

Hi, we have very similar sporadic crashes on two out of five "family related" EPYC 7F72 equipped SuperMicro Servers. They have all been purchased last fall and have the same OS version Linux host101010 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux.

Three machines are stable, two have sporadic crashes like the following with very similar patterns to what was discussed. We have ruled out a couple of other theories what could cause the crashes. Could they also be equipped with "bad processors"?

#10

Mar 23 20:07:52 host101010 kernel: BERT: Error records from previous boot:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: It has been corrected by h/w and requires no further action
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: event severity: corrected
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error 0, type: corrected
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: fru_text: ProcessorError
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Local APIC_ID: 0x0
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: CPUID Info:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000000: 00830f10 00000000 00300800 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error Information Structure 0:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error Structure Type: TLB error
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Check Information: 0x0000000000400005
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Transaction Type: 0, Instruction
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Level: 1
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Context Information Structure 0:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Array Size: 0x0050
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: MSR Address: 0xc0002181
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Array:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000000: 90004000000b0011 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000010: d01c0f9b00000000 0000000300000079
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000020: 0001000103b30400 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000040: 0010000000000000 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error 1, type: corrected
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: fru_text: ProcessorError
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Local APIC_ID: 0x80
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: CPUID Info:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000000: 00830f10 00000000 80300800 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error Information Structure 0:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error Structure Type: TLB error
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Check Information: 0x0000000000400005
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Transaction Type: 0, Instruction
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Level: 1
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Context Information Structure 0:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Array Size: 0x0050
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: MSR Address: 0xc0002181
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Array:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000000: 90004000000b0011 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000010: d01c0f9b00000000 0000000300000079
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000020: 0001000103b30400 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000040: 0010000000000000 0000000000000000
Mar 23 20:07:52 host101010 kernel: Freeing unused kernel image (initmem) memory: 2400K
Mar 23 20:07:52 host101010 kernel: Write protecting the kernel read-only data: 22528k

#11

Mar 30 20:22:43 host101011 kernel: BERT: Error records from previous boot:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: event severity: fatal
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error 0, type: corrected
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: fru_text: ProcessorError
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Local APIC_ID: 0x50
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: CPUID Info:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000000: 00830f10 00000000 50300800 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error Information Structure 0:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error Structure Type: TLB error
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Check Information: 0x0000000020410085
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Transaction Type: 1, Data Access
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Level: 1
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Overflow: true
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Context Information Structure 0:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Array Size: 0x0050
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: MSR Address: 0xc0002001
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Array:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000000: d820000000100015 0000000000000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000010: d01c0f9b00000000 000000070000007d
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000020: 000000b000000000 000000003a036d06
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000040: 0000000000000000 0000000000000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error 1, type: fatal
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: fru_text: ProcessorError
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Local APIC_ID: 0x51
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: CPUID Info:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000000: 00830f10 00000000 51300800 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error Information Structure 0:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error Structure Type: TLB error
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Check Information: 0x000000002641009d
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Transaction Type: 1, Data Access
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Level: 1
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Processor Context Corrupt: true
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Uncorrected: true
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Overflow: true
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Context Information Structure 0:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Array Size: 0x0050
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: MSR Address: 0xc0002001
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Array:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000000: fea0000000030015 0c005606abb6d000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000010: d01c0f9b00000000 000000070000007d
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000020: 000000b000000000 000000003d00001c
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000040: 0000000000000000 0000000000000000
Mar 30 20:22:43 host101011 kernel: rtc_cmos 00:01: setting system clock to 2021-03-30T18:22:28 UTC (1617128548)
Mar 30 20:22:43 host101011 kernel: Freeing unused kernel image memory: 1664K
Mar 30 20:22:43 host101011 kernel: Write protecting the kernel read-only data: 16384k

 

 

0 Likes