cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

mmat
Journeyman III

2x EPYC 7502 system hang and MCE link errors?

I have a Gigabyte R182-Z92 with two EPYC 7502 processors.

Linux and FreeBSD run both unstable and hang the system quite often.

I get the following repeating error messages on Linux:

[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:32 (17:31:0) MC27_STATUS[Over|CE|MiscV|-|-|-|SyndV|-|-|-]: 0xd82000000002080b
[Hardware Error]: IPID: 0x0001002e00001e01, Syndrome: 0x000000005a000009 [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2, Link Error.
[Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)

The same message on FreeBSD:

MCA: Bank 27, Status 0xd82000000002080b
MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x830f10, APIC ID 64
MCA: CPU 32 COR OVER BUSLG Source ERR I/O MCA: Misc 0x0

The message does appear after boot and then randomly. If I run:
echo 0 > /sys/devices/system/cpu/cpu32/online

The messages above appear much less frequently. The system does hang less often.

By putting the core back online I always get the error message immediately after the command below:

echo 1 > /sys/devices/system/cpu/cpu32/online


The error does not happen by issuing the command to any other of the remaining 63 cores.

The system does hang under heavier I/O load (e.g. 10GBit network traffic). The Mellanox ConnectX-5 card is in the NUMA domain 1 (The domain contains the core 32). There are 11 NVMe drives connected.

Is this a faulty second CPU? I have already contacted my reseller for replacement.

0 Likes
14 Replies
Anonymous
Not applicable

Hello mmat‌,

I am sorry you are experiencing some issues.  What version of Linux are you running with?  There are several versions that the EPYC 7002 series supports, but they are all more recent ones.  We do not generally support FreeBSD.

Have you tried swapping CPU 0 and CPU 1 to see if it follows the CPU vs the socket?

0 Likes

Hello mbaker_amd‌,

we are very much into bleeding edge software so I am running Ubuntu 20.04 with 5.6.0-1011-oem kernel (based on Linux 5.6.13) which is very recent. The technicians of our vendor are going to replace the second CPU and I will remotely re-do the tests right after while the technician is still there with someone of our staff. I understand that the probability of something like this happening after QAT is low but someone is always going to hit the "jackpot".

This has of course no effect on my opinion how great the EPYC Rome processors are . AMD did a great job here and EPYC 7xx2 remains our choice for the next servers. We originally wanted a Supermicro AS-1114S-WN10RT but they are going to start shipping these devices in Q3 2020. The RS500A-E10-RS12U might be an option, too.

0 Likes
Anonymous
Not applicable

Please let us know if the replacement works, and glad you like the processors so much.

0 Likes

Thanks, I am going to get more information on Friday.
Btw. I can imagine the 7502 with a defective core could easily be transformed into an operational 7402 by AMD .

0 Likes

mmat wrote:

Thanks, I am going to get more information on Friday.
Btw. I can imagine the 7502 with a defective core could easily be transformed into an operational 7402 by AMD .

AMD tests their processors before they even are mounted into the package. Then after packaging they are tested again. Lots of quality assurance tests in the industry generally.

Binning with devices is common. Video cards and processors alike are all over the dial. Look into semiconductor manufacturing if you want to learn more.

0 Likes

Looking over your post, it suggests your in need of a better distribution. AMD supports Ubuntu and CentOS which are both major distributions. Azure also has Ubuntu available for clients who need Linux virtual machines, CentOS is also available.

Melanox LAN cards usually work fine with mainstream operating systems. I use Intel but I have seen lots of Melanox hardware. Melanox is often found under Hyper-V with some rack based hardware. Hardware varies depending on the vendor.

I also would appreciate more info on the workload as there is some suggestion that the machine is at the thermal limit. There are some steps to take, such as more robust cooling solution to handle the thermal load.

0 Likes

Hello hardcoregames™‌,

we are doing caching on high-speed NVMe devices. To have all hardware properly supported we are using Ubuntu 20.04 with the official 5.6.0 OEM kernel series (currently 5.6.0-1011-oem). The server is located in a properly cooled datacenter and the 1U cooling solution from Gigabyte should not be a problem here.

0 Likes

About the only workloads I can think of that need that much resource are web servers and their databases.

Facebook can't use SSD, too slow. They built storage for their web server using all dynamic RAM. They are rich enough they could problem afford static RAM chips which are even faster.

AMD's server processors are quite adequate for most workloads. AMD has lots of cores and PCIe lanes.

So the server image was posted mostly so we are looking at the same hardware.

If you workload is that demanding, stuff the box full of 64GB memory sticks which can cache the SSD on the front panel nicely. The less workload on the SSD array the best performance.

Micron has 30TB SSD products for machines like this that can hold a lot of database etc.

I guess you are using a PCIe Melanox 10GBASE-T card

0 Likes
mmat
Journeyman III

I now have more information, this is what the system reports after a spurious reboot:

BERT: Error records from previous boot:
[Hardware Error]: event severity: fatal
[Hardware Error]:  Error 0, type: fatal
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x0
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 00200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: micro-architectural error
[Hardware Error]:    Check Information: 0x00000000009d0027
[Hardware Error]:     Error Type: 5, Internal Unclassified
[Hardware Error]:     Processor Context Corrupt: true
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Overflow: true
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc00021b1
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: faa000000002080b 0000000000000000
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 0001002e00001e01 000000005d000002
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0010000000000000 0000000000000000
[Hardware Error]:  Error 1, type: fatal
[Hardware Error]:  fru_text: ProcessorError
[Hardware Error]:   section_type: IA32/X64 processor error
[Hardware Error]:   Local APIC_ID: 0x40
[Hardware Error]:   CPUID Info:
[Hardware Error]:   00000000: 00830f10 00000000 40200800 00000000
[Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[Hardware Error]:   Error Information Structure 0:
[Hardware Error]:    Error Structure Type: micro-architectural error
[Hardware Error]:    Check Information: 0x00000000009d0027
[Hardware Error]:     Error Type: 5, Internal Unclassified
[Hardware Error]:     Processor Context Corrupt: true
[Hardware Error]:     Uncorrected: true
[Hardware Error]:     Overflow: true
[Hardware Error]:   Context Information Structure 0:
[Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[Hardware Error]:    Register Array Size: 0x0050
[Hardware Error]:    MSR Address: 0xc00021b1
[Hardware Error]:    Register Array:
[Hardware Error]:    00000000: faa000000002080b 0000000000000000
[Hardware Error]:    00000010: d01c0ff500000000 000000070000007d
[Hardware Error]:    00000020: 0001002e00001e01 000000005d00000a
[Hardware Error]:    00000030: 0000000000000000 0000000000000000
[Hardware Error]:    00000040: 0010000000000000 0000000000000000
0 Likes

These errors suggest RAM issues

0 Likes
Anonymous
Not applicable

Hello mmat‌,

have you swapped the processors yet?  If so, did the other issues go away?

  1. Make sure the parts are properly torqued, and that the heat sink is torqued
  2. At hardcoregames™‌ eluded to, perhaps DIMMs could be an issue here.
0 Likes

I have found a second case that has the same issue as me. They do have a system from Gigabyte as well (H262-Z61-00) and use two Epyc 7502 processors. They use other RAM and differently clocked as I do but they use BIOS with a very similar ChangeLog. Could this be a server vendor BIOS issue? Wrong timings etc.?


https://lkml.org/lkml/2020/5/29/401

0 Likes
Anonymous
Not applicable

Yes, it could be time for you to open a support ticket with Gigabyte.

0 Likes

Gigabyte has provided us via our vendor with non-official BIOS (version 16, AGESA 1.0.0.7). The BIOS did not completely resolve the issue. The system still reported Link Errors and was still unstable (but it has survived more load than before). We are now going to replace the system with a single-CPU ASUS RS500A-E10-RS12U instead (we will reuse the CPU, RAM and SSD drives). There  are not many systems with 10 NVMe drives around and vendors like thomas-krenn.com sell such systems so it should be a safer bet. Should the performance be insufficient (multi-process/multi-thread performance matters to us) we will upgrade the CPU from a 7502 to an 7702 or maybe wait for Rome price drops that will accompany the release of Milan.

0 Likes