cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

linusnilsson
Adept I

Random crashes PIB Threadripper PRO 3995WX (MCE)

Hi,

I've recently purchased a PIB Threadripper PRO 3995WX. Initial stress test revealed no issues, however I've since experienced two separate fatal crashes (both resulting in immediate reboot) during my work. The crashes appear to have similar origins.

Aug 16 09:23:10 fedora kernel: BERT: Error records from previous boot:
Aug 16 09:23:10 fedora kernel: [Hardware Error]: event severity: fatal
Aug 16 09:23:10 fedora kernel: [Hardware Error]:  Error 0, type: fatal
Aug 16 09:23:10 fedora kernel: [Hardware Error]:  fru_text: ProcessorError
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   Local APIC_ID: 0x77
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   CPUID Info:
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   00000000: 00830f10 00000000 77800800 00000000
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   Error Information Structure 0:
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    Error Structure Type: cache error
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    Check Information: 0x000000000606001f
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Transaction Type: 2, Generic
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Operation: 1, generic read
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Level: 0
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Processor Context Corrupt: true
Aug 16 09:23:10 fedora kernel: [Hardware Error]:     Uncorrected: true
Aug 16 09:23:10 fedora kernel: [Hardware Error]:   Context Information Structure 0:
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    Register Array Size: 0x0050
Aug 16 09:23:10 fedora kernel: [Hardware Error]:    MSR Address: 0xc0002061
Aug 16 09:23:10 fedora kernel: mce: [Hardware Error]: Machine check events logged
Aug 16 09:23:10 fedora kernel: mce: [Hardware Error]: CPU 123: Machine Check: 0 Bank 6: baa0000000050118
Aug 16 09:23:10 fedora kernel: mce: [Hardware Error]: TSC 0 MISC d01c0ff500000000 SYND 4d000000 IPID 600b000000000
Aug 16 09:23:10 fedora kernel: mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1629098589 SOCKET 0 APIC 77 microcode 830104d
Aug 16 09:23:10 fedora kernel: PM:   Magic number: 1:777:369


Aug 16 14:04:51 fedora kernel: BERT: Error records from previous boot:
Aug 16 14:04:51 fedora kernel: [Hardware Error]: event severity: fatal
Aug 16 14:04:51 fedora kernel: [Hardware Error]:  Error 0, type: fatal
Aug 16 14:04:51 fedora kernel: [Hardware Error]:  fru_text: ProcessorError
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   Local APIC_ID: 0x76
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   CPUID Info:
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   00000000: 00830f10 00000000 76800800 00000000
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   Error Information Structure 0:
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    Error Structure Type: cache error
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    Check Information: 0x000000000606001f
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Transaction Type: 2, Generic
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Operation: 1, generic read
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Level: 0
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Processor Context Corrupt: true
Aug 16 14:04:51 fedora kernel: [Hardware Error]:     Uncorrected: true
Aug 16 14:04:51 fedora kernel: [Hardware Error]:   Context Information Structure 0:
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    Register Array Size: 0x0050
Aug 16 14:04:51 fedora kernel: [Hardware Error]:    MSR Address: 0xc0002061
Aug 16 14:04:51 fedora kernel: PM:   Magic number: 1:1:77
Aug 16 14:04:51 fedora kernel: mce: [Hardware Error]: Machine check events logged
Aug 16 14:04:51 fedora kernel: mce: [Hardware Error]: CPU 59: Machine Check: 0 Bank 6: baa0000000050118
Aug 16 14:04:51 fedora kernel: mce: [Hardware Error]: TSC 0 MISC d01c0ff500000000 SYND 4d000000 IPID 600b000000000
Aug 16 14:04:51 fedora kernel: mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1629115490 SOCKET 0 APIC 76 microcode 830104d

dmesg -T | grep -i error reveals no additional errors

Motherboard is ASUS Pro WS WRX80E-SAGE SE WIFI with the latest BIOS and all 8 DIMMS populated with 64GB ECC 3200 QVL supported RDIMM. No changes to BIOS related to CPU or memory. I'm running Fedora 34 with kernel 5.13.6-200.fc34.x86_64. So everything is very vanilla.

Looking at the PPR reference [1] I believe the MSR I see (MSR Address: 0xc0002061) is described on page 245, but I'm not sure.

Question: Has anyone experienced similar and have any advice how to proceed? I do compilation and extensive laboratory work and stability is of highest importance.

Thank you.

[1] https://developer.amd.com/wp-content/resources/55803_B0_PUB_0_91.pd

0 Likes
12 Replies
linusnilsson
Adept I

Another one today:

 

Aug 17 17:04:23 fedora kernel: BERT: Error records from previous boot:
Aug 17 17:04:23 fedora kernel: [Hardware Error]: event severity: fatal
Aug 17 17:04:23 fedora kernel: [Hardware Error]:  Error 0, type: fatal
Aug 17 17:04:23 fedora kernel: [Hardware Error]:  fru_text: ProcessorError
Aug 17 17:04:23 fedora kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Aug 17 17:04:23 fedora kernel: [Hardware Error]:   Local APIC_ID: 0x77
Aug 17 17:04:23 fedora kernel: [Hardware Error]:   CPUID Info:
Aug 17 17:04:23 fedora kernel: [Hardware Error]:   00000000: 00830f10 00000000 77800800 00000000
Aug 17 17:04:23 fedora kernel: [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
Aug 17 17:04:23 fedora kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Aug 17 17:04:23 fedora kernel: [Hardware Error]:   Error Information Structure 0:
Aug 17 17:04:23 fedora kernel: [Hardware Error]:    Error Structure Type: cache error
Aug 17 17:04:23 fedora kernel: [Hardware Error]:    Check Information: 0x000000000606001f
Aug 17 17:04:23 fedora kernel: [Hardware Error]:     Transaction Type: 2, Generic
Aug 17 17:04:23 fedora kernel: [Hardware Error]:     Operation: 1, generic read
Aug 17 17:04:23 fedora kernel: [Hardware Error]:     Level: 0
Aug 17 17:04:23 fedora kernel: [Hardware Error]:     Processor Context Corrupt: true
Aug 17 17:04:23 fedora kernel: [Hardware Error]:     Uncorrected: true
Aug 17 17:04:23 fedora kernel: [Hardware Error]:   Context Information Structure 0:
Aug 17 17:04:23 fedora kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Aug 17 17:04:23 fedora kernel: [Hardware Error]:    Register Array Size: 0x0050
Aug 17 17:04:23 fedora kernel: [Hardware Error]:    MSR Address: 0xc0002061
Aug 17 17:04:23 fedora kernel: mce: [Hardware Error]: Machine check events logged
Aug 17 17:04:23 fedora kernel: mce: [Hardware Error]: CPU 123: Machine Check: 0 Bank 6: baa0000000050118
Aug 17 17:04:23 fedora kernel: mce: [Hardware Error]: TSC 0 MISC d01c0ff500000000 SYND 4d000000 IPID 600b000000000
Aug 17 17:04:23 fedora kernel: mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1629212662 SOCKET 0 APIC 77 microcode 830104d

 

0 Likes

You posted that you stressed tested the processor. What test did you use?

Was the Processor's Temperature within its Maximum Operating Range of 90C?

Try stress testing the processor and PSU by using OCCT Stress test software.

Check for unusual CPU Temperatures or PSU outputs outside of its 5% tolerances or any other unusual data from testing the CPU and PSU.

Could also be incompatible or RAM Setting issues such as timings or voltages. 

Is anything OCed like your CPU or GPU?

Not familiar with Fedora Linux OS so someone else would need to help you for that. 

Generally when a computer shuts down by itself as though you pressed the Off switch on the computer case indicates:

1- Overheating hardware

2- Power issues

3-Improper Overclocking

4-Failing or defective Hardware

5-Incompatible hardware issues

6-Possiblly Driver conflicts or issues

Try doing a CMOS CLEAR to reset the BIOS to factory standards in case it is a BIOS setting causing the issue.

To eliminate RAM issues try using one stick of RAM and see if the crashing continues or not. If one stick is not good enough to run the computer then install 2 sticks in the proper DIMM slots to check if the computer crashes or not.

Could also be a HDD/SDD issue so you might want to test the drive for errors or issues of failing.

Anyways, if it isn't a hardware issue and rather either a driver issue or Settings issue someone else will need to assist you.

You can also post your question at AMD Developer's Forum and see if they believe it applies to one of the developer's forums or not from here since you seemed to be a developer yourself: https://community.amd.com/t5/newcomers-start-here/bd-p/newcomer-forum

Also maybe at AMD Server Gurus since the Threadripper Pro is similar to a EPYC Processor except for consumers: https://community.amd.com/t5/server-gurus/ct-p/amd-server-gurus

 

 

0 Likes

Basically just sysbench, dd, fio to catch obvious issues such as incorrect installation, abnormal temperatures etc.

All crashes so far has happened during basically idle conditions. The mainboard includes OOB management and the logged CPU temperatures has been below +55C before crashing. Power load is far from max.

No OC, all stock settings in BIOS.

[1-6] Certainly may cause crashes but these in my experience leaves more trails in logs. The MCA error that do gets logged suggest a problem with the CPU (FP Machine Check Control). The crashes so far also might suggest a repeatable issue with CPU 123  and bank 6.

Thank you for the links to other forums. I will check with the server gurus.

Check Event Viewer ==> Custom views ==> administrative events 

Look for WHEA errors....   Even the ones that say they were correctable.

The only hardware error that I know that corrects itself is ECC memory, which you have thank goodness.

Your dump mentions Cache errors,  Banks of Memory, and Corrupt Processor Context (a memory whack that involves some of the most important memory used by the operating system)

So the question is, did hardware cause this (running too fast) or did software?

Your two dumps are very similar.  They both involve Apic x'77'.  It has been my experience shooting machine code over the years that Similar dumps tend to indicate something wrong with the software. 

Is any of your software bound to Apic x'77', or do they have an affinity towards it? 

Hardware errors tend to be much more random, in time and Place of corruption.

Continue to monitor Event viewer, even if you don't crash.

But I would think this is a problem with Operating system, driver or Application program.

 

0 Likes

I'm not running Windows but the equivalent errors is Linux is usually logged to syslog and/or dmesg.

rasdaemon has so far not reported any memory errors. The errors appears to come from the CPU.

I believe the banks as referred to in the output is banks of MSRs and not banks of DRAM memory (but I could be wrong).

Yes indeed the crash logs are similar. I've got another crash today...

[ 2.758272] BERT: Error records from previous boot:
[ 2.758273] [Hardware Error]: event severity: fatal
[ 2.758274] [Hardware Error]: Error 0, type: fatal
[ 2.758274] [Hardware Error]: fru_text: ProcessorError
[ 2.758275] [Hardware Error]: section_type: IA32/X64 processor error
[ 2.758276] [Hardware Error]: Local APIC_ID: 0x77
[ 2.758276] [Hardware Error]: CPUID Info:
[ 2.758278] [Hardware Error]: 00000000: 00830f10 00000000 77800800 00000000
[ 2.758279] [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[ 2.758280] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
[ 2.758281] [Hardware Error]: Error Information Structure 0:
[ 2.758281] [Hardware Error]: Error Structure Type: cache error
[ 2.758282] [Hardware Error]: Check Information: 0x000000000606001f
[ 2.758282] [Hardware Error]: Transaction Type: 2, Generic
[ 2.758283] [Hardware Error]: Operation: 1, generic read
[ 2.758284] [Hardware Error]: Level: 0
[ 2.758284] [Hardware Error]: Processor Context Corrupt: true
[ 2.758285] [Hardware Error]: Uncorrected: true
[ 2.758285] [Hardware Error]: Context Information Structure 0:
[ 2.758286] [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
[ 2.758286] [Hardware Error]: Register Array Size: 0x0050
[ 2.758287] [Hardware Error]: MSR Address: 0xc0002061
[ 2.758322] mce: [Hardware Error]: Machine check events logged
[ 2.758323] mce: [Hardware Error]: CPU 123: Machine Check: 0 Bank 6: baa0000000050118
[ 2.758402] mce: [Hardware Error]: TSC 0

 

0 Likes

What I draw from this is the predictability.

The machine check seems to emanate from the processor context being overwritten.

So the machine check itself, seems to stem from the some memory having been clobbered.

If it were hardware failing, I doubt that it would fail in the exact same place every time.

Example: Instruction say divide contents of Register 10 by value at address x'12345678', but oops that was stepped on as it's zero, and we all know we can't divide by zero.  Raise a machine check.   Thus there is nothing wrong with the hardware. It is being told to do something it can't carry out.  Thus I lean towards software bugs, which are much more reproduceable.

I am surprised though that the code seems to be dispatched to the same core each time it fails though.

Is there any way to turn that core off?

0 Likes
maxima121
Journeyman III

Basically just sysbench, dd, fio to catch obvious issues such as incorrect installation, abnormal temperatures etc.

All crashes so far has happened during basically idle conditions. The mainboard includes OOB management and the logged CPU temperatures has been below +55C before crashing. Power load is far from max.

No OC, all stock settings in BIOS.

[1-6] Certainly may cause crashes but these in my experience leaves more trails in logs. The MCA error that do gets logged suggest a problem with the CPU (FP Machine Check Control). The crashes so far also might suggest a repeatable issue with CPU 123  and bank 6.

Thank you for the links to other forums. I will check with the server gurus.

0 Likes

I've looked at the MSR bank 6 contents and debugged it with the help of the the PPR reference linked to in earlier posts. As mentioned I believe the MSR (MSR Address: 0xc0002061) is described on page 245 as a MCA::FP::MCA_STATUS_FP.
 
The 16 last bits of 0xbaa0000000050118 is 0x0118 = 0b0000 0001 0001 1000 which according to table 3.1.33 on page 202 is of memory error code type (0000 0001), memory transaction type Generic Read (0001), transaction type Generic (10) and cache level L0: Core (00). This ”manual decoding” luckily aligns well with the textual part of the dmesg output.
 
I conclude that I believe the issue is with APIC 0x76-0x77 and reading a FP L0 cache / register. Note that the MCE is thrown as it is unable to read from, I assume, a core register. It does not appear to have been thrown due to some illegal operations. Bad code may crash a kernel, but the CPU should still be able to read from itself.

Crashes are still occuring daily and I’ve asked AMD for an RMA.

Yes a CPU should be able to read a register.

However,  it seems the System is clearly identifying that the Processor context is corrupt, and what's more, it can not correct it.

As the system receives interrupts from outside devices,  such as HDD's, external clock generators etc. it's hardware flips in a new instruction pointer to handle the interrupt.    One of the first things that interrupt handler has to do is save the state of the system so that when it is time, the last process can pick up where it left off.  When the interrupt has been processed the dispatcher goes down a list of tasks that have been marked ready to run.   The dispatcher then programatically (ie uses instructions versus the electrical switch during the interrupt) restores a context of a task to be dispatched.  The context must be in a proper format, however the dispatcher saw it as being corrupt, and reported it so.

It is your Processor, an expensive one at that.   So you do what you feel is right.  However I still believe that this is software.

It could be your application code, or even the Linux kernel.   I suggest just temporarily loading up a disk with windows and seeing if you get the same error.  I doubt it.   I'm just trying to save you from barking up the wrong tree.

I don't know the specifics of your code, but could you port it to a different distribution of Linux?

Good Luck.

0 Likes

A follow up a couple of months later. I got my replacement 3995WX beginning of September. At time of writing almost two months later I've had  0 crashes - which is quite an improvement from almost daily crashes (...). Hardware/OS is identical, software identical (besides package updates). The error message, the troubleshooting and the fact that the replacement made such an apparent difference leads me to conclude the CPU was indeed malfunctioning. Hoping everything will work smoother from here on.

Hello,

I've read all that happened to you with your AMP threadripper pro and I've please some questions to ask to you.

I've bought 11 Threadripper Pro 3955WX and they are currently running full time.

The problem is that on 10 / 11 computers I have aleatory crashes (sudden black screen and reboot).

Some facts :

- Computers have the same Asus motherboard than yours

- Computers are running each day and off each night

- A computer is able to run several days without crashing

- I've never seen a computer crashing more than one time in a day

- I've an error in Windows system events logs pointing the processor as the problematic device

- I've made a lot of tests, updates, triggering, tries to stop the crashes and reboot, perhaps some improvements on the frequency, but nothing is really fixed

My main questions are :

- Do you think my problem is the same that you had ?

- How did you succeed on getting AMD RMA ??  I've contacted them and they enrolled me in a mail exchange protocol with a lot of dummy advices, asks for infos, asks for captures, and then other dummy devices, other asks for infos ...

Thanks in advance for your help !

 

0 Likes


@frnckdsrt wrote:

Hello,

I've read all that happened to you with your AMP threadripper pro and I've please some questions to ask to you.

I've bought 11 Threadripper Pro 3955WX and they are currently running full time.

The problem is that on 10 / 11 computers I have aleatory crashes (sudden black screen and reboot).

Some facts :

- Computers have the same Asus motherboard than yours

- Computers are running each day and off each night

- A computer is able to run several days without crashing

- I've never seen a computer crashing more than one time in a day

- I've an error in Windows system events logs pointing the processor as the problematic device

- I've made a lot of tests, updates, triggering, tries to stop the crashes and reboot, perhaps some improvements on the frequency, but nothing is really fixed

My main questions are :

- Do you think my problem is the same that you had ?

- How did you succeed on getting AMD RMA ??  I've contacted them and they enrolled me in a mail exchange protocol with a lot of dummy advices, asks for infos, asks for captures, and then other dummy devices, other asks for infos ...

Thanks in advance for your help !

 


Hi,

It is hard to say without more information. Broken CPUs in my experience are not that rare but they are not common either. If you are having 10 out of 11 CPUs malfunctioning regularly I would be surprised if there was actually an issue with the CPUs, assuming the CPUs has run within the manufacturer warranty specification (thermal limits etc). Zen 2 is pretty "old" now and most kernels should have good support I recon.

If possible, perhaps pick a couple of CPUs and update the motherboard to the latest BIOS (note that BMC needs to be updated too), reset all BIOS settings, and make sure you are running with QVL components [1]. If the problem goes away it would probably mean that something else is the issue.

RMA process was OK for me. I had to send in pictures of CPU in motherboard and my own analysis/troubleshooting. I only had to RMA a single CPU though.

Good luck!

[1] https://www.asus.com/se/Motherboards-Components/Motherboards/Workstation/Pro-WS-WRX80E-SAGE-SE-WIFI/...

0 Likes