AnsweredAssumed Answered

2x EPYC 7502 system hang and MCE link errors?

Question asked by mmat on Jun 9, 2020
Latest reply on Jul 4, 2020 by mmat

I have a Gigabyte R182-Z92 with two EPYC 7502 processors.

 

Linux and FreeBSD run both unstable and hang the system quite often.

 

I get the following repeating error messages on Linux:

[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:32 (17:31:0) MC27_STATUS[Over|CE|MiscV|-|-|-|SyndV|-|-|-]: 0xd82000000002080b
[Hardware Error]: IPID: 0x0001002e00001e01, Syndrome: 0x000000005a000009 [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2, Link Error.
[Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)

The same message on FreeBSD:

MCA: Bank 27, Status 0xd82000000002080b
MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x830f10, APIC ID 64
MCA: CPU 32 COR OVER BUSLG Source ERR I/O MCA: Misc 0x0

The message does appear after boot and then randomly. If I run:
echo 0 > /sys/devices/system/cpu/cpu32/online

The messages above appear much less frequently. The system does hang less often.

By putting the core back online I always get the error message immediately after the command below:

echo 1 > /sys/devices/system/cpu/cpu32/online


The error does not happen by issuing the command to any other of the remaining 63 cores.

 

The system does hang under heavier I/O load (e.g. 10GBit network traffic). The Mellanox ConnectX-5 card is in the NUMA domain 1 (The domain contains the core 32). There are 11 NVMe drives connected.

 

Is this a faulty second CPU? I have already contacted my reseller for replacement.

Outcomes