cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

Teary
Journeyman III

Absolute despair. Constant Processor Core fatal hardware errors causing BSOD. Please help!

Hi all, 

I have a brand new PC, built myself (experienced builder) and I am experiencing seemingly random BSODs when the PC is idle, after high load and any other time it decides. 

Windows event logger is consistently showing the following before/after each BSOD: 

EVENT ID: 18 - A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Bus/Interconnect Error
Processor APIC ID: 0

OR

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 0

I have seen others with similar issues and tried similar resolutions but to no avail. 

I have tried the following with crashes still happening after each: 

  • Fresh install of Windows (and installed all latest updates)
  • update chipset drivers
  • reset BIOS to optimized defaults and ran as normal
  • manually set DRAM speed and FCLK to XMP speed
  • enabled/disabled PBO in BIOS
  • update other system drivers using DriverBooster 8.1
  • ran sfc /scannow (which found errors and repaired them)
  • ran DISM /Online /Cleanup-Image /RestoreHealth
  • ran DSKCHK /R /F
  • completed MemTest86 4 hour run with no errors
  • completed OCCT full system load test with no errors

Running HWInfo64 whilst idle and during load shows no overheating, no strange voltages (not an expert), and no throttling or weird behaviour that I can see any different to my other similar PCs that work fine.

As I'm writing this the PC actually just BSOD again.... same errors as above...…. thank god for draft saving! Anyway...

I am absolutely stuck and I cannot have this PC into errors every 30 minutes from nothing, it's almost unusable. Do I need to take the whole system apart and rebuild it? Could it be the CPU cooler is too tight? Could it be a hardware fault with the MOBO or CPU? Could it be the proprietary included ASUS NVME add-in card?

Full system specs:

AMD Ryzen 9 5950x

ASUS CROSSHAIR VIII IMPACT (X570, Mini DTX)

2x 16GB Corsair Dominator Platinum 3600mhz CL16

4TB Corsair MP600 CORE Gen4 NVME SSD (installed using the included ASUS NVME add-in card)

ASUS ROG STRIX RTX 3090 GAMING OC

 

If anyone can look into this for me or requires more info I will monitor this regularly and provide any further details as required.

Many thanks in advance

Daniel

0 Likes
4 Replies
Gwillakers
Challenger

 

Get yourself back to square 1.    Go into BIOS and load Optimized defaults

First of all, your Machine check is a Hardware error, so stop messing with Software. The only thing you should look into with the Software is Windows Power Management policy, and make sure that Maximum Processor state and Minimum processor state is normal   (Max-100) (Min 10-15%)

Machine check is a phrase that has been used for decades in Data processing.  It does Not mean that your processor chip is broken.   It does mean that somehow, it was told to do something it could not.    For example (Divide by zero,  Execute the next instruction but the data at that location is garbage and it doesn't form an instruction,   Grab a piece of data at a location this task does not own, store some data at location 84 billion but there is no Ram covering that address,  Add the number 458 to the number "Cow".

The voltage range, along with the corresponding frequencies is the normal culprit for these crashes. Having these specified incorrectly lead to data corruption.  If you look at all they examples of Machine check that I laid out, you will notice that they are in one form or another, data corruption.  Most times machine checks come from Data Corruption.   Why people don't use Error Correcting memory is beyond me   I guess they would rather run fast than reliable.

So Stop overclocking your DIMMs along with the Internal Memory controller. Your Memory should be brought down to 3200MHz.  

Yes your Memory might be able to support that speed, but the internal Memory controller in the Ryzen Processor is only spec'd for 3200MHz (in the best case!) (I see that you are running DTX, so its good you are not trying to cram in 4 sticks)

The faster you run your Memory, the less time it has to store and retrieve the information.   The fewer wait states the you give your memory, the less time you have to store and retrieve the information.  The faster you run the cores, the less time the memory subsystem has.   This is very much like the game "hide and go seek".    You remember the phrase in that game?  After counting down its  "ready or not here I come"!   Two parties (working somewhat independently) have an agreed amount of time to complete a task, if the memory isn't ready in time (we all lose).

People just overlook the clue that one of these type of errors is specifically mentioning "Cache Hierarchy", and the other error is resulting from some sort of corruption.

=================    Recommendations:

You mentioned that your system crashes while idle after some load.    When the processor goes idle, it will reduce it's frequency, and also down volt itself.   However, this minute amount of voltage might not be enough when the silicon is hot.  Remember it takes more voltage to push current through a hot chip than a cold one.

For this case, lets increase the voltage to the cores and the internal memory controller by the smallest amount possible.  Change VCore from Auto to Normal.  A displacement field should open up, set it to +0.006     Likewise change VSoc from Auto to Normal.  Another displacement field should open up.  Set the displacement to +0.006    A little goes a long way.

Doing the above keeps the same Voltage curve as Auto, but it increases the Low end and the High end of the curve by 0.006 V

We were not looking to increase the high end, as the high voltages are enticing the system to use higher frequencies that the internal memory controller can not handle.

So:  We will cut back on the Power(PPT) to discourage the system from using the Highest frequencies.    

Set PBO to advanced.

Set Limits to Manual

Set PPT down from 142 to 120  (watts)

==========================

If you want, some people have problems with Temperature.   Don't mess with Voltages to try to get temps under control.

It is better to have Ryzen control the Temps.   

(The following is optional)

Set Thermal Limit to Manual.

In the new field that opens up ("Thermal Limit") set it to what you consider a decent value in Cesius (I like 70)

=======================

Making the changes above, should have resulted in a system that raises the floor just a bit, pushes the ceiling down a bit and runs more reliably and cooler.

 

Hi! Thanks for this, really helpful. 

 

I will be implementing your suggested changes in the BIOS shortly and I’ll get back to you with the results. 


0 Likes

did you solve it?

0 Likes
USN-DC2-RET
Adept I

This is sort of out of my pay grade currently but I can confirm AMD told me on the phone, presale, that the 5950X supports up to 3200Mhz RAM. Hope you get it straight and working as desired. As for any overclocking, it seems like a fun hobby in the hobby, but I don't, I did back in the day some, but not anymore.

AMD 5950X, ASUS Dark Hero, Corsair 128G DP 3600 CL18, Corsair H170i 420mm AIO, ASUS RTX 3090 24G OC, Corsair AX 1600i, Samsung 980 Pro 2T X 2, Corsair 7000D Airflow, Samsung 32 @ 4K.
0 Likes