cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

sho1sho1sho1
Adept I

AMD Genoa 9534 cluster - L1 and L3 uncorrectable ECC errors running amd-zen-hpl-avx2-2023_01.tar.gz

I have a cluster of 18 nodes with AMD Genoa 9534 installed with Rocky Linux 9.1 updated to the latest kernel.
I started to run hpl linpack on each individual node with amd-zen-hpl-avx2-2023_01.tar.gz.  The longer I run the hpl linpack, the more Uncorrectable ECC errors pop up.
I have been running hpl linpack for about 2 hours and about 5 nodes have one or more ECC Uncorrectable ECC.
Are these actual DDR5 DIMM errors which need DIMM replacements or are these processor errors?

May 3 12:29:57 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 3 12:29:57 localhost kernel: [Hardware Error]: Deferred error, no action required.
May 3 12:29:57 localhost kernel: [Hardware Error]: CPU:2 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0x942030000400011b
May 3 12:29:57 localhost kernel: [Hardware Error]: Error Addr: 0x00000000f5ca39c0
May 3 12:29:57 localhost kernel: [Hardware Error]: PPIN: 0x02b61d4cd51c0019
May 3 12:29:57 localhost kernel: [Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0x90d399430b800000
May 3 12:29:57 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
May 3 12:29:57 localhost kernel: mce: Uncorrected hardware memory error in user-access at bd57a63c0
May 3 12:29:57 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 3 12:29:57 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 3 12:29:57 localhost kernel: [Hardware Error]: Uncorrected, software restartable error.
May 3 12:29:57 localhost kernel: [Hardware Error]: CPU:35 (19:11:1) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135
May 3 12:29:57 localhost kernel: [Hardware Error]: Error Addr: 0x0000000bd57a63c0
May 3 12:29:57 localhost kernel: [Hardware Error]: PPIN: 0x02b61d4cd51c0019
May 3 12:29:57 localhost kernel: [Hardware Error]: IPID: 0x001000b02186ae00
May 3 12:29:57 localhost kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.
May 3 12:29:57 localhost kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
May 3 12:29:57 localhost kernel: Memory failure: 0xbd57a6: Sending SIGBUS to amd-zen-hpl-avx:108649 due to hardware memory corruption
May 3 13:01:36 localhost systemd[1]: Rebuild Hardware Database was skipped because of a failed condition check (ConditionNeedsUpdate=/etc).
May 3 13:01:38 localhost kernel: Hardware name: Giga Computing R283-Z90-AAD1-000/MZ93-FS0-000, BIOS F08a 03/30/2023
May 3 13:01:38 localhost kernel: Hardware name: Giga Computing R283-Z90-AAD1-000/MZ93-FS0-000, BIOS F08a 03/30/2023
May 3 13:01:46 localhost systemd[1]: Starting Hardware Monitoring Sensors...
May 3 13:01:46 localhost systemd[1]: Finished Hardware Monitoring Sensors.
May 3 13:01:46 localhost NetworkManager[17912]: <info> [1683136906.8920] manager[0x55be79b64030]: rfkill: Wi-Fi hardware radio set enabled
May 3 13:01:46 localhost NetworkManager[17912]: <info> [1683136906.8920] manager[0x55be79b64030]: rfkill: WWAN hardware radio set enabled
May 3 14:14:08 localhost kernel: Hardware name: Giga Computing R283-Z90-AAD1-000/MZ93-FS0-000, BIOS F08a 03/30/2023
May 3 21:09:32 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 3 21:09:32 localhost kernel: [Hardware Error]: Deferred error, no action required.
May 3 21:09:32 localhost kernel: mce: Uncorrected hardware memory error in user-access at 4b449ea500
May 3 21:09:32 localhost kernel: [Hardware Error]: CPU:67 (19:11:1) MC22_STATUS[Over|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xd42030000400011b
May 3 21:09:32 localhost kernel: [Hardware Error]: Error Addr: 0x000000023f0d3200
May 3 21:09:32 localhost kernel: [Hardware Error]: PPIN: 0x02b61d4cd51c005e
May 3 21:09:32 localhost kernel: [Hardware Error]: IPID: 0x0000109600750f00, Syndrome: 0x1fbdb6940b800008
May 3 21:09:32 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 3 21:09:32 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 3 21:09:32 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 3 21:09:32 localhost kernel: [Hardware Error]: Uncorrected, software restartable error.
May 3 21:09:32 localhost kernel: [Hardware Error]: CPU:126 (19:11:1) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135
May 3 21:09:32 localhost kernel: [Hardware Error]: Error Addr: 0x0000004b449ea500
May 3 21:09:32 localhost kernel: [Hardware Error]: PPIN: 0x02b61d4cd51c005e
May 3 21:09:32 localhost kernel: [Hardware Error]: IPID: 0x001010b0228cae00
May 3 21:09:32 localhost kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.
May 3 21:09:32 localhost kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
May 3 21:09:32 localhost kernel: Memory failure: 0x4b449ea: Sending SIGBUS to amd-zen-hpl-avx:3434250 due to hardware memory corruption
May 3 22:19:23 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 3 22:19:23 localhost kernel: [Hardware Error]: Deferred error, no action required.
May 3 22:19:23 localhost kernel: [Hardware Error]: CPU:2 (19:11:1) MC21_STATUS[Over|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xd42030000400011b
May 3 22:19:23 localhost kernel: [Hardware Error]: Error Addr: 0x0000000247fcf700
May 3 22:19:23 localhost kernel: [Hardware Error]: PPIN: 0x02b61d4cd51c0019
May 3 22:19:23 localhost kernel: [Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0xf3e2a8df0b800008
May 3 22:19:23 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 3 22:19:23 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 3 22:19:24 localhost kernel: mce: Uncorrected hardware memory error in user-access at 1bafdb9e00
May 3 22:19:24 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 3 22:19:24 localhost kernel: [Hardware Error]: Uncorrected, software restartable error.
May 3 22:19:24 localhost kernel: [Hardware Error]: CPU:12 (19:11:1) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135
May 3 22:19:24 localhost kernel: [Hardware Error]: Error Addr: 0x0000001bafdb9e00
May 3 22:19:24 localhost kernel: [Hardware Error]: PPIN: 0x02b61d4cd51c0019
May 3 22:19:24 localhost kernel: [Hardware Error]: IPID: 0x001000b02208ae00
May 3 22:19:24 localhost kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.
May 3 22:19:24 localhost kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
May 3 22:19:24 localhost kernel: Memory failure: 0x1bafdb9: Sending SIGBUS to amd-zen-hpl-avx:3483880 due to hardware memory corruption

0 Likes
1 Solution

Thank you for your help!

I have narrowed it down to the DIMMs as the cause of the issue.  

We have Micron DIMMs with Rambus interface chips and these are getting many Uncorrectable ECC errors when running them at room temperature > 85F (DIMM temp around 60C).

Our vendors are testing with Micron DIMMs with IDT interface chips and they are not getting any Uncorrectable ECC errors.

We swapped all the 16GB Micron DIMMs (Rambus) with 32GB Micron DIMMs (IDT) and we no longer get any Uncorrectable ECC errors even running at room temperature of 95F.

I hope this will help someone else who encounters these Uncorrectable ECC Errors.

View solution in original post

0 Likes
3 Replies
shrjoshi
Staff

Hello @sho1sho1sho1 

Thank you for contacting AMD Server guru. 

The binary  you are using is compiled for specific Operating System. We have tested the executable on the OS mentioned in the document and it's working fine for us.  Can you please check the dependencies listed in the link at your end. 

Link : https://www.amd.com/en/developer/zen-software-studio/applications/pre-built-applications/zen-hpl.htm...

For any  Hardware related errors, please check with your admin.

0 Likes
shrjoshi
Staff

Hello @sho1sho1sho1 

Can you please clarify if you are observing this issue with this particular tar file or for other applications as well?

0 Likes

Thank you for your help!

I have narrowed it down to the DIMMs as the cause of the issue.  

We have Micron DIMMs with Rambus interface chips and these are getting many Uncorrectable ECC errors when running them at room temperature > 85F (DIMM temp around 60C).

Our vendors are testing with Micron DIMMs with IDT interface chips and they are not getting any Uncorrectable ECC errors.

We swapped all the 16GB Micron DIMMs (Rambus) with 32GB Micron DIMMs (IDT) and we no longer get any Uncorrectable ECC errors even running at room temperature of 95F.

I hope this will help someone else who encounters these Uncorrectable ECC Errors.

0 Likes