cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

Teary
Journeyman III

Absolute despair. Constant Processor Core fatal hardware errors causing BSOD. Please help!

Hi all, 

I have a brand new PC, built myself (experienced builder) and I am experiencing seemingly random BSODs when the PC is idle, after high load and any other time it decides. 

Windows event logger is consistently showing the following before/after each BSOD: 

EVENT ID: 18 - A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Bus/Interconnect Error
Processor APIC ID: 0

OR

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 0

I have seen others with similar issues and tried similar resolutions but to no avail. 

I have tried the following with crashes still happening after each: 

  • Fresh install of Windows (and installed all latest updates)
  • update chipset drivers
  • reset BIOS to optimized defaults and ran as normal
  • manually set DRAM speed and FCLK to XMP speed
  • enabled/disabled PBO in BIOS
  • update other system drivers using DriverBooster 8.1
  • ran sfc /scannow (which found errors and repaired them)
  • ran DISM /Online /Cleanup-Image /RestoreHealth
  • ran DSKCHK /R /F
  • completed MemTest86 4 hour run with no errors
  • completed OCCT full system load test with no errors

Running HWInfo64 whilst idle and during load shows no overheating, no strange voltages (not an expert), and no throttling or weird behaviour that I can see any different to my other similar PCs that work fine.

As I'm writing this the PC actually just BSOD again.... same errors as above...…. thank god for draft saving! Anyway...

I am absolutely stuck and I cannot have this PC into errors every 30 minutes from nothing, it's almost unusable. Do I need to take the whole system apart and rebuild it? Could it be the CPU cooler is too tight? Could it be a hardware fault with the MOBO or CPU? Could it be the proprietary included ASUS NVME add-in card?

Full system specs:

AMD Ryzen 9 5950x

ASUS CROSSHAIR VIII IMPACT (X570, Mini DTX)

2x 16GB Corsair Dominator Platinum 3600mhz CL16

4TB Corsair MP600 CORE Gen4 NVME SSD (installed using the included ASUS NVME add-in card)

ASUS ROG STRIX RTX 3090 GAMING OC

 

If anyone can look into this for me or requires more info I will monitor this regularly and provide any further details as required.

Many thanks in advance

Daniel

0 Likes
11 Replies
Gwillakers
Challenger

 

Get yourself back to square 1.    Go into BIOS and load Optimized defaults

First of all, your Machine check is a Hardware error, so stop messing with Software. The only thing you should look into with the Software is Windows Power Management policy, and make sure that Maximum Processor state and Minimum processor state is normal   (Max-100) (Min 10-15%)

Machine check is a phrase that has been used for decades in Data processing.  It does Not mean that your processor chip is broken.   It does mean that somehow, it was told to do something it could not.    For example (Divide by zero,  Execute the next instruction but the data at that location is garbage and it doesn't form an instruction,   Grab a piece of data at a location this task does not own, store some data at location 84 billion but there is no Ram covering that address,  Add the number 458 to the number "Cow".

The voltage range, along with the corresponding frequencies is the normal culprit for these crashes. Having these specified incorrectly lead to data corruption.  If you look at all they examples of Machine check that I laid out, you will notice that they are in one form or another, data corruption.  Most times machine checks come from Data Corruption.   Why people don't use Error Correcting memory is beyond me   I guess they would rather run fast than reliable.

So Stop overclocking your DIMMs along with the Internal Memory controller. Your Memory should be brought down to 3200MHz.  

Yes your Memory might be able to support that speed, but the internal Memory controller in the Ryzen Processor is only spec'd for 3200MHz (in the best case!) (I see that you are running DTX, so its good you are not trying to cram in 4 sticks)

The faster you run your Memory, the less time it has to store and retrieve the information.   The fewer wait states the you give your memory, the less time you have to store and retrieve the information.  The faster you run the cores, the less time the memory subsystem has.   This is very much like the game "hide and go seek".    You remember the phrase in that game?  After counting down its  "ready or not here I come"!   Two parties (working somewhat independently) have an agreed amount of time to complete a task, if the memory isn't ready in time (we all lose).

People just overlook the clue that one of these type of errors is specifically mentioning "Cache Hierarchy", and the other error is resulting from some sort of corruption.

=================    Recommendations:

You mentioned that your system crashes while idle after some load.    When the processor goes idle, it will reduce it's frequency, and also down volt itself.   However, this minute amount of voltage might not be enough when the silicon is hot.  Remember it takes more voltage to push current through a hot chip than a cold one.

For this case, lets increase the voltage to the cores and the internal memory controller by the smallest amount possible.  Change VCore from Auto to Normal.  A displacement field should open up, set it to +0.006     Likewise change VSoc from Auto to Normal.  Another displacement field should open up.  Set the displacement to +0.006    A little goes a long way.

Doing the above keeps the same Voltage curve as Auto, but it increases the Low end and the High end of the curve by 0.006 V

We were not looking to increase the high end, as the high voltages are enticing the system to use higher frequencies that the internal memory controller can not handle.

So:  We will cut back on the Power(PPT) to discourage the system from using the Highest frequencies.    

Set PBO to advanced.

Set Limits to Manual

Set PPT down from 142 to 120  (watts)

==========================

If you want, some people have problems with Temperature.   Don't mess with Voltages to try to get temps under control.

It is better to have Ryzen control the Temps.   

(The following is optional)

Set Thermal Limit to Manual.

In the new field that opens up ("Thermal Limit") set it to what you consider a decent value in Cesius (I like 70)

=======================

Making the changes above, should have resulted in a system that raises the floor just a bit, pushes the ceiling down a bit and runs more reliably and cooler.

 

Hi! Thanks for this, really helpful. 

 

I will be implementing your suggested changes in the BIOS shortly and I’ll get back to you with the results. 


0 Likes

did you solve it?

0 Likes

Way late to this, but I have been dealing with this exact issue with a very similar build and this was the closest set of symptoms I could find online, so I tried everything mentioned here. I updated all drivers, rolled back drivers, bought a new PSU, adjusted voltages -  I was convinced it was not a hardware issue because I have a way overspeced build for what I do. Turned out it was a memory module.

Get the free version of MemTest86, requires creating bootable usb drive but its super easy. Right off the bat it indicated memory was causing errors. Ended up putting each of my four modules in and testing one at a time until (of course the fourth one) showed errors. I tested each module in the slot it was originally and once the fourth one failed, I put it in the first slot to rule out anything else and it failed again. Diagnosis was bad memory module. No problems since removing.

0 Likes

Hi, I know this is an old post but since I have the same issues I read it and I noticed you said  Minimum processor state should be 10-15%. It was at 5%, so I moved it to 10% before doing anything else. I read everywhere only 5% is normal though. Are you sure 10% is fine?

I only just changed this so don't know whether it fixes anything. I have stocks settings (btw) no overclock.

Thanks!

0 Likes
USN-DC2-RET
Adept I

This is sort of out of my pay grade currently but I can confirm AMD told me on the phone, presale, that the 5950X supports up to 3200Mhz RAM. Hope you get it straight and working as desired. As for any overclocking, it seems like a fun hobby in the hobby, but I don't, I did back in the day some, but not anymore.

AMD 5950X, ASUS Dark Hero, Corsair 128G DP 3600 CL18, Corsair H170i 420mm AIO, ASUS RTX 3090 24G OC, Corsair AX 1600i, Samsung 980 Pro 2T X 2, Corsair 7000D Airflow, Samsung 32 @ 4K.
0 Likes

You didn't post your Make & Model of your PSU.

Go into BIOS and see if you have a PSU Settings “Power supply idle control”. Most of the time it is set to "Auto" or Low. If it is change it to "TYPICAL". See if that fixes your issue.

This website explains about the above setting in more detail: https://techysmag.com/856/power-supply-idle-control/

2. Compatibility Issues: Some systems may not be compatible with specific Power supply idle control settings, resulting in instability or crashes.

Some PSUs, normally the older PSUs or low quality PSU can't support this feature and will crash or shut down your PC when the processor is at idle or near idle.

 

Koyote7667
Challenger

,,,,

0 Likes
lose311
Journeyman III

I had a similar problem, though mine was crashing mostly under load, playing fullscreen games (Ryzen 5900X, MSI MPG X570 Gaming Plus, RTX3080, full specs here). Tried fresh Windows 11/10, multiple software/driver/BIOS configs, memtest, sfc checks, etc. Nothing worked.

The following settings resolved it for me. All in the MSI BIOS:

All other BIOS settings default, overclocking disabled (this is how I've always run my PCs, nothing new on that front for me). I also got a new CPU cooler (Noctua NH-U14S) and case fans since this CPU seems to run hot but I'm pretty sure it was the BIOS settings above that actually fixed it for me.

I realize with these settings I'm losing out on some performance, so I still want to determine if any of them can be set back to default. But I'd rather have a stable PC, so I'm leaving them for now. If I can find a better balance I'll update this post.

Thanks to everyone in this thread (and many others). This was the one that finally gave me the pieces I needed. 

0 Likes
IceGlider
Adept I

Although this thread is dated, I can see that it remains a current problem.

To my understanding, there should be three causes.

1) Bad BIOS settings forcing data loss;

2) CPU overheating problem (cheap thermal paste?);

3) Cheap or failing power supply. Some PSUs cannot handle voltage spikes, under load. So a proper powerbar, or UPS (with sine wave regulator) could give better protection to your PC components, from an unstable power grid.

Hoping this will help, good luck to you!

0 Likes
misterj
Big Boss

Teary, please post screenshots of the Details tab for several of your crashes from the Event Viewer. Thanks and enjoy, John.

0 Likes