I have a Gigabyte R182-Z92 with two EPYC 7502 processors.
Linux and FreeBSD run both unstable and hang the system quite often.
I get the following repeating error messages on Linux:
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:32 (17:31:0) MC27_STATUS[Over|CE|MiscV|-|-|-|SyndV|-|-|-]: 0xd82000000002080b
[Hardware Error]: IPID: 0x0001002e00001e01, Syndrome: 0x000000005a000009 [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2, Link Error.
[Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)
The same message on FreeBSD:
MCA: Bank 27, Status 0xd82000000002080b
MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x830f10, APIC ID 64
MCA: CPU 32 COR OVER BUSLG Source ERR I/O MCA: Misc 0x0
The message does appear after boot and then randomly. If I run:
echo 0 > /sys/devices/system/cpu/cpu32/online
The messages above appear much less frequently. The system does hang less often.
By putting the core back online I always get the error message immediately after the command below:
echo 1 > /sys/devices/system/cpu/cpu32/online
The error does not happen by issuing the command to any other of the remaining 63 cores.
The system does hang under heavier I/O load (e.g. 10GBit network traffic). The Mellanox ConnectX-5 card is in the NUMA domain 1 (The domain contains the core 32). There are 11 NVMe drives connected.
Is this a faulty second CPU? I have already contacted my reseller for replacement.