cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

linusnilsson
Adept I

Random crashes PIB Threadripper PRO 3995WX (MCE)

Hi,

I've recently purchased a PIB Threadripper PRO 3995WX. Initial stress test revealed no issues, however I've since experienced two separate fatal crashes (both resulting in immediate reboot) during my work. The crashes appear to have similar origins.

Aug 16 09:23:10 fedora kernel: BERT: Error records from previous boot:
Aug 16 09:23:10 fedora kernel: [Hardware Error]: event severity: fatal
Aug 16 09:23:10 fedora kernel: [Hardware Error]:  Error 0, type: fatal
Aug 16 09:23:10 fedora kernel: [Hardware Error]:  fru_text: ProcessorError
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   Local APIC_ID: 0x77
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   CPUID Info:
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   00000000: 00830f10 00000000 77800800 00000000
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   Error Information Structure 0:
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    Error Structure Type: cache error
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    Check Information: 0x000000000606001f
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Transaction Type: 2, Generic
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Operation: 1, generic read
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Level: 0
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Processor Context Corrupt: true
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Uncorrected: true
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   Context Information Structure 0:
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    Register Array Size: 0x0050
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    MSR Address: 0xc0002061
Aug 16 09:23:10 fedora kernel: mce: [Hardware Error]: Machine check events logged
Aug 16 09:23:10 fedora kernel: mce: [Hardware Error]: CPU 123: Machine Check: 0 Bank 6: baa0000000050118
Aug 16 09:23:10 fedora kernel: mce: [Hardware Error]: TSC 0 MISC d01c0ff500000000 SYND 4d000000 IPID 600b000000000
Aug 16 09:23:10 fedora kernel: mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1629098589 SOCKET 0 APIC 77 microcode 830104d
Aug 16 09:23:10 fedora kernel: PM:   Magic number: 1:777:369


Aug 16 14:04:51 fedora kernel: BERT: Error records from previous boot:
Aug 16 14:04:51 fedora kernel: [Hardware Error]: event severity: fatal
Aug 16 14:04:51 fedora kernel: [Hardware Error]:  Error 0, type: fatal
Aug 16 14:04:51 fedora kernel: [Hardware Error]:  fru_text: ProcessorError
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   Local APIC_ID: 0x76
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   CPUID Info:
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   00000000: 00830f10 00000000 76800800 00000000
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   Error Information Structure 0:
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    Error Structure Type: cache error
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    Check Information: 0x000000000606001f
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Transaction Type: 2, Generic
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Operation: 1, generic read
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Level: 0
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Processor Context Corrupt: true
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Uncorrected: true
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   Context Information Structure 0:
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    Register Array Size: 0x0050
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    MSR Address: 0xc0002061
Aug 16 14:04:51 fedora kernel: PM:   Magic number: 1:1:77
Aug 16 14:04:51 fedora kernel: mce: [Hardware Error]: Machine check events logged
Aug 16 14:04:51 fedora kernel: mce: [Hardware Error]: CPU 59: Machine Check: 0 Bank 6: baa0000000050118
Aug 16 14:04:51 fedora kernel: mce: [Hardware Error]: TSC 0 MISC d01c0ff500000000 SYND 4d000000 IPID 600b000000000
Aug 16 14:04:51 fedora kernel: mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1629115490 SOCKET 0 APIC 76 microcode 830104d

dmesg -T | grep -i error reveals no additional errors

Motherboard is ASUS Pro WS WRX80E-SAGE SE WIFI with the latest BIOS and all 8 DIMMS populated with 64GB ECC 3200 QVL supported RDIMM. No changes to BIOS related to CPU or memory. I'm running Fedora 34 with kernel 5.13.6-200.fc34.x86_64. So everything is very vanilla.

Looking at the PPR reference [1] I believe the MSR I see (MSR Address: 0xc0002061) is described on page 245, but I'm not sure.

Question: Has anyone experienced similar and have any advice how to proceed? I do compilation and extensive laboratory work and stability is of highest importance.

Thank you.

[1] https://developer.amd.com/wp-content/resources/55803_B0_PUB_0_91.pd

0 Likes
12 Replies