cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

mak57
Journeyman III

EPYC Family 17h MCE Bank Description

I've been bounced around AMD support and forums trying to find the decoder ring for a failure as follows. What is Bank 17? The info I found shows bank17 as reserved. And bank 0 as LS. What is LS? Is there a description somewhere of what the the various banks are? Do we just replace the processor for any failure that shows bank errors? Thank you

2020-12-27T02:49:10.14663 Socket# 0, Ccd# 0, Ccx# 0, Core# 1, Thread# 0
2020-12-27T02:49:10.14664 MCA Bank Number : 17
2020-12-27T02:49:10.14664 MCA_STATUS : 0xDC2030000000011B
2020-12-27T02:49:10.14665 MCA_ADDR : 0x04000001A3239E40
2020-12-27T02:49:10.14666 MCA_SYND : 0x515988080B800001
2020-12-27T02:49:10.14667 MCA_MISC0 : 0xD01C0FFD01000000
2020-12-27T02:49:10.14667 MCA_MISC1 : 0xD01C0FF501000000
2020-12-27T02:49:10.14668 MCA_IPID : 0x0000009600250F00

2020-12-27T02:49:10.14704 Socket# 0, Ccd# 0, Ccx# 0, Core# 3, Thread# 0
2020-12-27T02:49:10.14705 MCA Bank Number : 0
2020-12-27T02:49:10.14706 MCA_STATUS : 0xBC002800000C0135
2020-12-27T02:49:10.14706 MCA_ADDR : 0x0100000B96473D40
2020-12-27T02:49:10.14707 MCA_SYND : 0x0000000000000000
2020-12-27T02:49:10.14708 MCA_MISC0 : 0xD01C0FF500000000
2020-12-27T02:49:10.14709 MCA_MISC1 : 0x0000000000000000
2020-12-27T02:49:10.14710 MCA_IPID : 0x000000B000000000
2020-12-27T02:49:10.14711 MCA_ADDR : 0x0100000B96473D40
2020-12-27T02:49:10.14712 MCA_SYND : 0x0000000000000000

1 Reply
Soul_keeper
Journeyman III

I've been wondering the same thing since I bought this CPU over a year ago.

I really wish someone would answer this.

Certain programs are incorrectly reporting bank 17 ie: ras-mc-ctl says "bank Unified Memory Controller (bank=17)"

From the 17h Open source register reference pdf provided by AMD:

3.17.2 Mapping of Banks to Blocks
Table 23: MCA Bank to Block Mapping

banks 17 and 18 are listed as reserved.

AMD has not released a 19h version of this pdf.

dmesg:

[Tue Jan 3 10:33:28 2023] mce: [Hardware Error]: Machine check events logged

ras-mc-ctl --errors |tail

...

529 2023-01-03 10:33:28 -0700 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x3f1b62440, misc=0xd01a000101000000, walltime=0x63b466e9, cpuid=0x00a20f12, bank=0x00000011

They happen about once a day for me. It is a mystery what bank 17 is.

Can someone please clarify ?

 

Thanks.

0 Likes