Hello,
I have a new machine in my cluster that is sporadically crashing every few days and I am not able to determine why. I hope somebody has an idea why this is happening.
The only hint about what is happening are the log entries I appended at the end. These crashes result in the OS to instantaneously getting a black screen, then it goes to POST and reboots. I hope that somebody has an idea how to find the corporate.
This is the story: We bought 8 machines with one AMD EPYC 7282 CPU each (including 8x Kingston 32GB RAM Modules per machine in ASRockRack EPYCD8-2T mainboards) for our machine learning cluster. All machines are identical down to the screws. Now for 95 days, seven machines are running perfectly without any problems. The 8th machine is crashing when ever it wants.
All the operation systems (CentOS 8.1) are installed via a script from an internal repository and should be exactly the same. I even reinstalled the sad computer several times.
I tested the RAM (Memtest86+ and in OS tests) again and again. I found no problems with the RAM. I had crashes with all the 8 RAM modules installed, with the first three RAM slots populated as well the last four memory slots populated. For keeping it consistent and preventing mixing, I install the RAM modules always in their "original" slot, where the assembler put them. Thus if the RAM is the reason then it needs to be at least two modules.
The computers are in a server cold room (19°C) and their tower chassis are stuffed with many large fans (3x 12 cm in the front, 1x 9cm fan in the back, Noctua NH-U14S TR4-SP3 CPU cooler with two Noctua NF-A15 fans and Thermal Grizzly Kryonaut thermal paste). The CPU temperature (via the sensor tool) reads out between 20°C in idle and max 45°C under 100% load. Thus I rule out overheating.
I did some torture test with mprime. No problem for days. Then I just reboot the system and it cashes directly after booting during idling.
I have the rasdeamon running. After dozens of crashes this is the sad yield:
[->]
(base) [davrot@granat5 ~]$ ras-mc-ctl --summary
Memory controller events summary:
Corrected on DIMM Label(s): 'unknown memory' location: 0:0:3:-1 errors: 2
PCIe AER events summary:
14 Fatal errors: Poisoned TLP
No Extlog errors.
No MCE errors.
(base) [davrot@granat5 ~]$ ras-mc-ctl --errors
Memory controller events:
1 2020-05-06 18:05:22 +0200 1 Corrected error(s): at unknown memory location: 0:0:3:-1, addr -1488575744, grain 1, syndrome 20568
2 2020-05-07 00:27:39 +0200 1 Corrected error(s): at unknown memory location: 0:0:3:-1, addr -847838464, grain 1, syndrome 2939
PCIe AER events:
1 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
2 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
3 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
4 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
5 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
6 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
7 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
8 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
9 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
10 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
11 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
12 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
13 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
14 2020-05-08 13:18:12 +0200 Fatal error: Poisoned TLP
No Extlog errors.
No MCE errors.
[<-]
I removed the mentioned memory module but it continued to crash anyway. And the Poisoned TLP could have be generated by some irregularity with the 10GB network switch we had that day.
The only real hint are these log entries, I found after the crashes...
Here are the entries from the two latest crashes:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: event severity: fatal
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Error 0, type: fatal
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: fru_text: ProcessorError
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Local APIC_ID: 0x4
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: CPUID Info:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: 00000000: 00830f10 00000000 04200800 00000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Error Information Structure 0:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Error Structure Type: cache error
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Check Information: 0x000000000614001f
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Transaction Type: 0, Instruction
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Operation: 5, instruction fetch
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Level: 0
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Processor Context Corrupt: true
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Uncorrected: true
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Context Information Structure 0:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Register Array Size: 0x0050
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: MSR Address: 0xc0002051
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Register Array:
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: 00000000: baa0000000090150 0000000000000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: 00000010: d01c0ff500000000 0000000300000079
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: 00000020: 000500b000000000 000000004d000002
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Jul 13 05:05:17 granat5 kernel: [Hardware Error]: 00000040: 0000000000000000 0000000000000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: event severity: fatal
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Error 0, type: fatal
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: fru_text: ProcessorError
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Local APIC_ID: 0x4
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: CPUID Info:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: 00000000: 00830f10 00000000 04200800 00000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Error Information Structure 0:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Error Structure Type: cache error
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Check Information: 0x000000000614001f
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Transaction Type: 0, Instruction
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Operation: 5, instruction fetch
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Level: 0
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Processor Context Corrupt: true
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Uncorrected: true
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Context Information Structure 0:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Register Array Size: 0x0050
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: MSR Address: 0xc0002051
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: Register Array:
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: 00000000: baa0000000090150 0000000000000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: 00000010: d01c0ff500000000 0000000300000079
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: 00000020: 000500b000000000 000000004d000002
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Jul 10 23:43:07 granat5 kernel: [Hardware Error]: 00000040: 0000000000000000 0000000000000000
I am out of my depth and I hope that someone has an idea what is going on. I am trying to figure out the underlying reason for some weeks now.
If you have experienced something similar, then please accept my heartfelt condolences.
Thanks!
Solved! Go to Solution.
Hello davrot,
This appears to be a bad processor. Please work with your reseller/OEM to process an RMA on the part.
I would try tweaking the memory timing and see if that fixes the error reports
Try relaxing the RAM timing incrementally
Check for a motherboard BIOS update too
Hello davrot,
This appears to be a bad processor. Please work with your reseller/OEM to process an RMA on the part.
Hi, we have very similar sporadic crashes on two out of five "family related" EPYC 7F72 equipped SuperMicro Servers. They have all been purchased last fall and have the same OS version Linux host101010 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux.
Three machines are stable, two have sporadic crashes like the following with very similar patterns to what was discussed. We have ruled out a couple of other theories what could cause the crashes. Could they also be equipped with "bad processors"?
#10
Mar 23 20:07:52 host101010 kernel: BERT: Error records from previous boot:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: It has been corrected by h/w and requires no further action
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: event severity: corrected
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error 0, type: corrected
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: fru_text: ProcessorError
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Local APIC_ID: 0x0
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: CPUID Info:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000000: 00830f10 00000000 00300800 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error Information Structure 0:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error Structure Type: TLB error
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Check Information: 0x0000000000400005
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Transaction Type: 0, Instruction
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Level: 1
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Context Information Structure 0:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Array Size: 0x0050
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: MSR Address: 0xc0002181
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Array:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000000: 90004000000b0011 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000010: d01c0f9b00000000 0000000300000079
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000020: 0001000103b30400 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000040: 0010000000000000 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error 1, type: corrected
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: fru_text: ProcessorError
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Local APIC_ID: 0x80
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: CPUID Info:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000000: 00830f10 00000000 80300800 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error Information Structure 0:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Error Structure Type: TLB error
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Check Information: 0x0000000000400005
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Transaction Type: 0, Instruction
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Level: 1
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Context Information Structure 0:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Array Size: 0x0050
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: MSR Address: 0xc0002181
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: Register Array:
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000000: 90004000000b0011 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000010: d01c0f9b00000000 0000000300000079
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000020: 0001000103b30400 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Mar 23 20:07:52 host101010 kernel: [Hardware Error]: 00000040: 0010000000000000 0000000000000000
Mar 23 20:07:52 host101010 kernel: Freeing unused kernel image (initmem) memory: 2400K
Mar 23 20:07:52 host101010 kernel: Write protecting the kernel read-only data: 22528k
#11
Mar 30 20:22:43 host101011 kernel: BERT: Error records from previous boot:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: event severity: fatal
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error 0, type: corrected
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: fru_text: ProcessorError
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Local APIC_ID: 0x50
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: CPUID Info:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000000: 00830f10 00000000 50300800 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error Information Structure 0:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error Structure Type: TLB error
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Check Information: 0x0000000020410085
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Transaction Type: 1, Data Access
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Level: 1
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Overflow: true
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Context Information Structure 0:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Array Size: 0x0050
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: MSR Address: 0xc0002001
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Array:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000000: d820000000100015 0000000000000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000010: d01c0f9b00000000 000000070000007d
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000020: 000000b000000000 000000003a036d06
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000040: 0000000000000000 0000000000000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error 1, type: fatal
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: fru_text: ProcessorError
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: section_type: IA32/X64 processor error
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Local APIC_ID: 0x51
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: CPUID Info:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000000: 00830f10 00000000 51300800 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error Information Structure 0:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Error Structure Type: TLB error
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Check Information: 0x000000002641009d
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Transaction Type: 1, Data Access
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Level: 1
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Processor Context Corrupt: true
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Uncorrected: true
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Overflow: true
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Context Information Structure 0:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Array Size: 0x0050
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: MSR Address: 0xc0002001
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: Register Array:
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000000: fea0000000030015 0c005606abb6d000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000010: d01c0f9b00000000 000000070000007d
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000020: 000000b000000000 000000003d00001c
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000030: 0000000000000000 0000000000000000
Mar 30 20:22:43 host101011 kernel: [Hardware Error]: 00000040: 0000000000000000 0000000000000000
Mar 30 20:22:43 host101011 kernel: rtc_cmos 00:01: setting system clock to 2021-03-30T18:22:28 UTC (1617128548)
Mar 30 20:22:43 host101011 kernel: Freeing unused kernel image memory: 1664K
Mar 30 20:22:43 host101011 kernel: Write protecting the kernel read-only data: 16384k
I would try tweaking the memory timing and see if that fixes the error reports
Try relaxing the RAM timing incrementally
Check for a motherboard BIOS update too
Thank you very much!
I will investigate the RAM timing / BIOS version first and if this doesn't solve the problem, I will RMA the CPU.
Hello, @davrot would you let me know if replacing CPU solved the problem?
Hi, I didn't replace the CPU. I reduced the RAM clock far below spec (26xx) then the error stopped. I had to do that with another one system too. But 6 other systems are running in spec.
Oh, thanks @davrot. that's interesting. it seems you did all kinds of RAM test but couldn't detect it, right? What was the original speed of the RAM? Do you think it is a RAM quality issue?
Yes, I did RAM tests for days. It showed no problem. Normally I would expect an error. I had a similar case with a 13 Intel system that also couldn't hold the nominal RAM speed, and I got errors immediately during the RAM test.
And I didn't skimp on the RAM, I bought Kingston Server Premier RDIMM DDR4-3200, CL22-22-22, 32GB modules. And everything is the same in the 8 machines, but 2 systems are acting up.
However, it has to be one of these components:
Kingston Server Premier RDIMM DDR4-3200, CL22-22-22, 32 GB
ASRock EPYCD8-2T motherboard
AMD Epyc 7282
And it has to be linked to 'aging', as the second machine started having problems some time later.
I started with 3200 (and 6 machines are still running at that speed). I don't know what the real minimum speed is. I just wanted the system to work. I checked my notes and the two problematic machines are currently running at 2133.
Darvot: Any reason you got the 2s capable processors for single-processor boards? That won't cause any errors, and they're slightly cheaper than the 16 core P model, but...
2 ECC errors in 2 days isn't super alarming for larger RAM chips. I've seen varying estimates on the count per week you should expect from cosmic rays and they're decently high. A failing chip would be throwing them continuously.
Poisoned TLP errors from PCIe devices are a bit odd but they all seem to have happened in a very short period. Those wouldn't be from the switch, they're usually CRC errors between the processor and 10Gb networking and can only happen on commands that modify data and since those ports are directly wired to the processor that's not a good sign either.
Nobody else has mentioned it, but what are your power supplies? What are all the other components? Were the boards / procs used or new?
The main issue is that the actual unrecoverable error (an instruction fetch) should never be happening on a working processor. AFAIK that indicates that the instruction couldn't be pulled from icache on the processor, which is a problem you'll never workaround. 2133 might be bottlenecking the processor enough that it's not running full speed. Like the AMD employee said, either the processor is bad or the mounting in the socket is bad. The thing that's wrong in the 2/8 machines is faulty hardware.
Zen2 Epyc runs natively at 3200, not 2666 btw.
Where did you buy these from? The setup seems like a really bad idea for an ML cluster (more like something somebody shady might pitch), and even with the overkill HSF (I use the same ones on my Threadripper and Supermicro's BMC barely spins the fans up until I run Prime95). But just the waste heat from the backplate of the 7900XTX running full blast is enough to raise processor temps by 10C if I have the GPU in the nearest slot. Luckily it fits nicely in the bottom slot where it heats up nothing but itself. Anyway I'm betting you don't have a bunch of accelerators because it would probably get the CPU up higher than 45C whether you wanted it to or not because the air passing the heat sink heats up fast. Going that crazy with the heatsink and even knowing what the thermal compound is somehow strikes me as bizarre... whoever built them shouldn't have been excited about that in a server. Something with the insane surface area of an Epyc 7002 and 130W TDP can and usually is cooled passively in a rackmount setup. The 280W TR pro is easier to cool than my old 6 core haswell-e and the ebay'd 10-core broadwell-e were. The air cooled heat sink for those (some coolermaster probably) was around 2x as large as this notua and couldn't keep up, they ended up needing liquid to run at their full boost speeds. My fans only crank up if I'm encoding 4 1080p HEVC streams at the same time or there's rendering that can't be pawned off on the GPU happening. Some of the all-integer stuff still won't do it.
Anyway the reason for all that is that I'm kind of wondering if somebody had a bunch of defective low core count processors they pulled from 2s servers that needed tons of memory and memory bandwidth more than CPU power to resell as barebones (they may not have known they were defective or they may have) and a bunch of workstations with the 1S ASRock boards, some fast-for-the-time workstation GPUs and 64/128C processors installed that were being replaced by new models by whatever VFX firm that they needed to pull the processors and video cards from to resell, and decided to try to pawn off the leftovers as machine learning systems. I might be completely off-base there but I don't see much going on that a desktop 16C Ryzen wouldn't have been able to do, probably better unless they're loaded with a bunch of unlisted GPUs.
And it has to be linked to 'aging', as the second machine started having problems some time later.
Processors don't age, not really. Solid state motherboard components generally don't either. We're past the days of capacitors blowing up 5 years in.
The one thing that does is the power supply. If you don't have a server supply with I2C power monitoring to check health you'll need to look for fluctuating voltages and other things, or just try swapping them out. A low wattage PSU might just be having trouble to begin with since the cheap brands consistently over-rate the maximum power. Even though those processors are low power, I've hit issues on similar processors from a PSU that could barely deliver the GPU transients under load combined with terrible input power from the house lines in a newer house. I finally hooked up my Tripp-Lite line conditioner that I don't normally use because it's loud as heck when it switches over from line and it was detecting random spots of <95V coming in on the 120V AC and showing red lights for "won't be able to fix this if it gets worse" according to the manual. Used case + motherboard combos that started off with 7001s in them at the beginning of their lifetime since that board took either and may have had better redundant PSUs in them to begin with (pulled to sell on ebay of course) might have near-dead consumer PSUs in them now.
I have pretty much the same advice to the guy hitting TLB errors, it sounds like possible bad cache somewhere on the CPU and just needs a replacement most likely, but there's the off chance heat or bad power is causing that too.
For hardcoregames: Most server boards don't support adjusting RAM timings aside from clock speed, which isn't what you're talking about. Asus boards might, but they're a completely terrible manufacturer who exposes things like PCIe retimer settings that just tell me they never bothered setting everything up correctly themselves and are going to try to make end-users fix them. Nobody overclocks ram on servers or adjusts timings, they RMA the bad equipment because the point of buying server hardware is having stability and not having to deal with BIOSes that overvolt ram and CPU consistently and can only have 50% of their onboard ports / slots working at any given time without something breaking. On Epyc after gen1 you usually want to be running the ram at the native clock of the processor so downclocking it forcibly should only be a stopgap to figure out that you have bad components. It should really always be set to Auto or 3200.