cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

Marcus2012
Adept II

5600x WHEA error reboots at idle CPU? RAM? MOBO? BIOS?

Hi everyone

This is my first AMD build and I was having an issue.  I was hoping someone could help please?  My build is

ASUS B-550F Gaming Wi-Fi
BIOS 2423
AMD 5600X
EVGA RTX 2070 ultra XC
EVGA 750GQ PSU
Crucial Ballistix BL2K8G36C16U4B 3600 MHz, DDR4, DRAM, Desktop Gaming Memory Kit, 16GB (8GB x2), CL16

 

The problem I have is that I get reboots (no BSOD) when idling. The most apparent example of this is playing CSGO (FS windowed) and then tabbing the second screen and walking away. CSGO at this point is lower frame rate and CPU usage. Then upon reboot I get the following WHEA error.

A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Bus/Interconnect Error
Processor APIC ID: 0

This means very little to me but I understand this is a more than common and can potentially be a few things. Most people tend to RMA their CPU and this solves their problem (I can do this if needs be but I can still return to store).
Before I do that though I wanted to be sure that it wasn't anything else. e.g.

RAM speed. 3600MHz is technically out of spec (too fast?)
Motherboard BIOS settings (all set to docp settings)

As there was no BSOD at all I was confused so I began to see that maybe it was Power Supply Idle Control which others had mentioned. I understand this is because at low loads the PSU thinks the computer is sleeping and kills power? I have now set it to typical and can idle with CSGO...for now.

I have the option to change my CPU in store (they have been very accommodating) but do I need to? Should I test longer? The thing is I've never known how to "simulate" a proper idle short of just walking away for hours and this doesn't seem to trigger it. it's almost like it has to go from real high load to nothing for it to happen.

If anyone can offer me some advice it would be very much appreciated.

Thanks in advance everyone

33 Replies
ryzen_type_r
Challenger

A couple of things:

- by default, Windows reboots after a blue screen.  Sometimes you can get a blue screen and it reboots so quickly that you don't see the blue screen, but it was actually there.  If you disable the auto reboot you'll know for sure.

- the EVGA GQ series of power supplies aren't particularly good - poor ripple control and transient response (when the PSU encounters a sudden change in load).  It may or may not be the cause of your problem, but in this case I would suspect the PSU before the CPU.  The higher end (more $$$) EVGA power supplies are a lot better (e.g., G3 or P2, I know EVGA's alphabet soup naming is confusing).  Or just get a Corsair RMx.

- 3200Mhz is the max officially supported memory speed for these Ryzens.  Anything higher is overclocking, and if you overclock you better know what all the various clocks and timings do.  And Ryzen is REALLY fussy about memory.

If you know that tabbing out of CSGO can cause the crash, then that's the best way to test it.  When you get the WHEA errors, is it always a bus interconnect error?  If so, you can try bumping up the SOC voltage very slightly to see if it helps.  But out of the pieces of hardware, I'd suspect the PSU first.

Edit:  it may be that setting the power supply idle to typical might be enough.  Still, if that PSU is still in the return window I'd return it and pay more for a better unit.

 

Hi

 

Thanks for replying, I have disabled auto reboot now.  I will usualy get a Kernal_power critical error after the WHEA, "The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly."

 

-I have been testing my PSU voltages but only with a DMM which doesn't have the sample rate to pick much up tbh but it seems ok.  It is coming up to 3 years old now and has worked very well in my previous, power hungry, 7700k system and that didn't use it's full capacity.  But it is not a new design so probably doesn't meet spec for lower power applications.  I could only RMA it at this point if needed. 

-I was always concerned about the RAM speed as I know SO little about timings.  It was suggested to me by a sales rep and I knew I should have gone for the max officially supported.  So now I am.  I'm getting the 3200MHz Crucial equivilant but with 4x8GB to give me 4 ranks.

 

-Yes the WHEA errors are always bus/interconnect.  I'm not sure what it means by "bus/interconnect"but it is always this one.  It seems really common when google it along with other people getting "Cache Hierarchy Error" but I never got those.

 

0 Likes

First let me address something that I often find mentioned in these posts.

It seems that people are particularly worried that their systems crash even while idle.

Rhetorical question:   Do you know when the processor will select the highest voltage and the maximum frequency?

Answer:   It is right when it is coming out of idle, and it is lightly loaded.   It is precisely when it only has 1 or two things to do, that it will ratchet up the speeds and run the cores and memory controller faster than it was built for.

I'm no electrician, but in life I've noticed that electric cars run faster and lights burn brighter when you up the voltage.

I've also noticed that the more Christmas lights you put on a tree the more current it consumes.

So using the basic equation : A X V = W 

We can rewrite it to be :    A =  W/V

So suppose I have a Program with 5 Threads and each thread will consume 10 Amps.

Further suppose that my system can only handle 100 Watts(PPT) before burning up.

So the question becomes :   How high can the system raise Voltage, and still supply the necessary current to the cores?

or          50 Amps =  100 Watts / ?V 

If V becomes greater than 2, then the right side of the equation will not match 50 Amps needed on the left.

So running 5 threads requiring 50 amps, with a power limit of 100W means you can't raise voltage more than 2 volts.

 

Now lets take a look at what happens when the load on the system doubles.   You now need to run 10 threads each requiring 10 amps for a total workload of 100Amps.    Your system still can only handle a power loading of 100W.

Now how much  voltage can your system apply and still supply the 100Amps?

100Amps = 100W/ ?V

Answer:  In order to handle the increased workload the system must drop the voltage to no higher than 1 V.

Understand that the voltage will later determine what frequency the cores will run at.

So we have shown, that the higher the load on the system, the more current(A) will be required, and the corresponding Voltage(V) will have to be lowered if the Power(PPT(W)) remains constant.

You can run more things concurrently, but you have to run them at a lower voltage and thus more slowly (lower frequency)

Warning:  There are overclockers out there that would have you just raise PPT watts.    However, they are totally ignoring TDP.

And I see the hands going up in the back of the class.  Oh Professor, Intel and AMD say you can run their processors with a higher wattage than TDP.    True but that is only temporary.   You have to understand, that TDP represents the amount of Heat that Thermodynamically can pass through the materials of the Processor itself.  (Ceramic Die, TIM/Solder, Metal Heat Spreader). Understand that the Heat is generated in the Die(s), it then has to flow through the Thermal Interface Material, and get into the Heat Spreader.   It is only after that it gets into the Heat spreader, that it can pass through your TIM, and enter Your Heatsink and then get carried away by Air or Water currents.   There is nothing that you can do external to the processor that can improve on its TDP.   You can not extract Heat that has not yet made it's way through the internal materials.

The TDP  for the 5900X and 5950X is 105 Watts.  The TDP for 5600X is 65W.  That is an internal bottleneck, like it or not.   Your great cooling solution, doesn't matter here.

There was quite a ruckus a number of years ago, whether Intel was using a cheap TIM or solder for it's internal materials.

===========

One more thing, run your memory no faster than 3200MHz.     That is the maximum speed of the Internal Memory controller.  It doesn't matter that the memory you bought is specified for 3600MHz, just run it without DOCP at 3200MHz.

People tend to ignore memory speed.   They feel that just because they plunked down a chunk of change for their sticks, they should be allowed to ignore the specifications of the processor.  They fail to understand, that the memory subsystem needs time to complete it's loads and stores.   If not given enough time, memory corruption occurs.   The vast majority of these crashes are due to the Memory subsystem not having enough time.   It is a shame that more people don't use Error correcting memory.  If they were to utilize it more, they would see the memory errors more quickly, often before the  system was brought down.   They want to run fast and care less about being reliable.  That's a shame because with Error correcting memory, one can see exactly how far one can push his/her sticks.   They don't have to guess like the people using Non-ECC.

====

Back to the topic:

If you have a system that boots and stays up for minutes at a time, then you are pretty close to stable.  Only the correct minor adjustments need be applied.

It is possible that at low frequencies you need a bit more Voltaage.  I would run the normal curve but with the slightest addition to voltage.    So Set VCore from Auto to Normal.   Set the differential field to the lowest positive setting +0.006V

Also set VSoc from Auto to Normal.  Set the differential field to the lowest positive setting +0.006V

That should boost the voltage curve at both the low end and the High end.   However we did not want to boost it at the high end. 

High voltage is enticing the system to boost the frequencies of the cores to a point beyond which the memory controller can handle.   So to discourage the system from boosting to higher frequencies, we will lower the power a bit.

Set PBO to Advanced.    Set Limits to Manual.   Set PPT down on your 5600x from 88W to 77W

Don't worry 77W is still plenty high since your processor is only rated to dissipate 65W (TDP) of heat.

By lowering PPT, the system must keep the voltage lower to assure ample current.   With the lower voltage, the system will select lower frequencies and you should not get the Memory corruption.  (But make sure you run Memory at spec (3200MHz)

Not OP, but this seems to have solved it for me. Was getting 50-60ºC just browsing the Internet and over 85ºC when gaming and crashing after a few minutes. Now holding in low 70s when playing CP2077, which is quite demanding and the BSOD seems to have stopped.

My issue was also entirely unrelated to memory, as I never in my life have felt the need to OC RAM.

0 Likes

I once got a random reboot when OCing memory, was at 3667Mhz 16-18-18-don't_remember.

Now I'm running 3200@14-16-16-36-56 and it's perfectly fine. 

So yeah, set your RAM to 3200, fclk to 1600 and tighten the timings instead of going for frequency. 

The guy up above has some good ideas, but since it's a huge sheet of copypaste I wouldn't blindly trust every word without testing. 

Just happened again, Kernel-power 41 followed closely by WHEA error 18 (bus/interconnect) just when left after a csgo game.  Getting some 3200MHz tomorrow and returning the 3600MHz just to be safe.  What does bus/interconnect refer to?  My IMC and/or RAM?

0 Likes

Most likely your infinity fabric? By default it functions based off of RAM clock, e.g. 1600MHz FCLK for 3200DDR RAM, 1800MHz for 3600. But the last one is over speck, so may be problematic. Also might be faulty RAM by itself, but very unlikely. 

There are known instances of WHEA been caused by faulty CPUs too, though, so it's not like anyone can guarantee it's RAM-only problem.

Thanks, I knew I shouldn't havent gone near OCing RAM.  It's not my thing lol.  The FCLK is 1800MHz atm with the RAM so it could be that.  I'll replace the RAM and keep testing, otherwise I guess it's CPU or PSU.  I tried to understand that long post but a lot of that confused me but that might be because it's late.  I swear I've read that before somewhere though.  I can replace the CPU if necessary the retailer will let me.

0 Likes

Quick question guys.  If I use 3200MHz do I still have to use a DOCP or will it automatically register it?

Depends. There are JEDEC specifications for 3200MHz RAM, but 99% chance is that ordinary consumer RAM would be DOCP 3200, JEDEC 2666. Not big deal. Try defaults first, even if slow, see if you get WHEA, then go for DOCP. 

Actually, if it works fine without errors, I insistently recommend manual timings. XMP/DOCP tend to use rather loose timing control, like 16-18-18 for 3200DDR, which is often much slower than the memory can do. I have Crucial Ballistix 2x32Gb kit, 3200DDR default (DOCP), 16-18-18-36-72, and at that timings it could've had 3667MHz if not for WHEA, maybe even 3800. At 3200 it's a-ok with 14-16-16-36-56. But! Before you try anything, first of all, make sure you fixed the problem. 

So I did not get my RAM today but tomorrow now.  Well that's not entirely true as they delivered some RAM and it was the right ones but....they forced it through my letterbox and trashed it.  I have a parcel box.... Oh well.  Anyway, interesting developments.

-  after disabling auto-restart it still just rebooted so definately no BSOD.

- Left to to idle for a few hours with nothing running...it did not crash.

Only crashes after a few competetive csgo matches and then i go idle on second screen.

0 Likes

Just as a point of reference, I'm running a 5800X on a B550-F non-Wifi (so almost the same mobo as yours) BIOS 2407, EVGA Supernova 750 G3, and 2x16GB Corsair LPX 3200 C16 (dual rank), and the system is rock solid.

Have you tried disconnecting the second screen and see if you can get it to crash?  Or running the game in FS Exclusive? Just to rule out any weird driver behavior, altho in my experience video driver issues usually cause blue screens and not sudden resets.

While you're waiting for the new memory, try bumping up SOC voltage in small increments and see if it stabilizes.  Don't go more than 1.1V total though.

You mentioned your previous 7700K system was fine with the same PSU, but did that system also have the 2070?  In my mind, I'm still suspicious that it might be a PSU transient response issue.

If that were my system I'd try something like running Furmark and Prime95 at the same time to max out power draw, then kill both processes at the same time and see if the problem manifests.

Does the 2070 have 2 PCIe power plugs, and if so, are you running both off the same cable?  If so, try separate cables to the PSU.

 

Well it is good to know the motherboard isn't systemically faulty or anything hehe.  Not unless they really messed up adding that wi-fi and bluetooth

I have checked and on auto the SoC voltage is already set to 1.1V which I thought was a bit high but as it was auto I left it.  I've had the 2070 for almost 3 years now and it was solid with the 7700K and I've always run two seperate PCI-E power for the two connectors, no horrible y-connectors.  I will try furmark but I have been playing other games (doom eternal, spyro reignited) and these can top my GPU usage and cause no problems at all.  None of these came close to using max CPU draw though but I've played games while doing handbrake encodes with no issues.  I have had PSU failures before but previously they all failed at high loads, not low ones.  And it's not instant either it can be upto 2-3minutes from tabbing out and as low as 10 seconds.  I'm reluctant to think GPU issue either as, like you say, this is usually BSOD (driver failure and recovered etc.)  I didn't use the extra cpu 4-pin on this board as I'm not overclocking the CPU, just the standard 8-pin but it wouldn't be that would it?

0 Likes

I'd plug the 4-pin ATX in as a 'best practice' and to eliminate that as a possible cause.

So you're only getting the restarts after tabbing out of CSGO?  You can tab out of any other game and it's fine?  The restarts never happen any other time?

Shot in the dark: check the Nvidia release notes for your Geforce driver version to see if there's a known issue with CSGO (hey, you never know).  Also you can try playing around with the Nvidia control panel 3D settings for CSGO.  There's a power management mode setting, by default it's Optimal Power which basically drops GPU and RAM clocks when the GPU isn't that busy - try setting it to Prefer Maximum Performance, see if that helps.  You can also try putting a frame rate limit on the game (just thinking about that Amazon game that supposedly is frying RTX's because it's running at 800fps at the menu screen lol).

I'd also try using HWinfo for logging, then go thru the log files and see if anything acts wonky just before the restarts happen.

 

 

just to add my two cents of experience.

my system specs are:

5600x

2x 8gb 3600mhz cl14 ram

asus strix x570 motherboard

6900xt gpu

when i play BF5 or rogue company, my 300w gpu runs in a low power state only drawing around 75-100w and a clock speed of only 500-800mhz. this cause a gpu driver crash that i have seen reboot my system with no visible BSOD. 

as you was talking about CS:GO and the issue arising when closing the game or alt + tabbing to second monitor. 

i dunno if it is even possible that this is a connected issue with power states though.

I have it on max performance atm and the screens are on their native 1080p 165Hz. Yeah it's really weird that it only happens in csgo while transitioning to idle but csgo does something the others do not.  When I select FS windowed in my others game they just render as normal when I am working on other screens.  However CSGO in FS windowed seems to drop fps to 19 when tabbed out, in a game or the menu doesn't matter (it has always done this going back to launch never seems an issue and my cpu/gpu usage goes down).  I only have DOOM eternal, spyro, swtor and csgo on my system atm and only csgo crashes it but only csgo has that fps drop when in other window.  CSGO has never particularly taxed my GPU though, tends to put more pressure on my CPU but swtor does this too and it's notoriusly CPU bound, no crashes.  I do have to say though that this doesnt happen when just idling with csgo, it's only noticable after playing a game or two of competitive.  High to low transition, so things you mention fit such as PSU transients, clock timings and FCLK can't get stable 1800MHz. 

 

My GPU draws approx. 35-40W at an idle 1410MHz according to afterburner (HWiNFO matches this), so it's low but it's always transitioned these power states fine in the past

0 Likes

i was meaning more  your AMD CPU having power issues and my AMD GPU having power issues somehow being related.

I don't know too much about over clocking but for BF5 i overclocked my GPU to run at what it say on the box for boost clock speed and that brought my GPU power draw up to 170-250w and stopped all my crashes.

is there some way to lock clock speed on your CPU to run the 4.6ghz boost speed without fluctuating? 

0 Likes

Well it just crashed again and of course I didn't have HWiNFO running. argh!!  But...it wasn't at idle, I was in firefox to read the forums while csgo was on the other screen at the menu. Hadn't played for over 10 mins either.

0 Likes

Well it just did it again andnot really at idle, did have csgo in the bacjground again and literally just stood up.  HWiNFO was logging this time and the is no sudden/ strange chages I can see from the logs, all voltages were stable (didnt have DRAM voltage there).  The cores were in low power states but had been for some time.  Shall i add the csv file to the thread? actually can i attach files at all?

0 Likes

Yeah this one's a head scratcher.  I mean it's only happening under a very very specific set of circumstances, tough to pin this one down.

Things to try if you haven't already: running just a single screen (assuming CSGO can be minimized), bumping mem voltage to 1.4V, updating audio and network drivers (network drivers on Asus site aren't the latest, d/l from Intel's site intead)..

Maybe ask around on a CSGO forum to see if anybody else has experienced this?

 

0 Likes

Well it happens at 3200MHz too SO i'm guessing not RAM.  So that leaves, CPU, Mobo, GPU and PSU.  I do have a second 5600x now to try but it not sure if it's the issue.  CSGO forums mention restarts but from what I read it's when they are playing heavily and the culprit is the PSU.  One thing I did notice is my BIOS doesn't seem to monitor my DRAM voltage on the right side hardware monitor (middle bit) and I cannot monitor it anywhere in windows else either.  Is this normal?  Is it a crucial thing and should I have gonna corsair?

I think windows update updated all of my ethernet drivers and I update my realtek from ROG forums.

0 Likes

Set Fixed Clock for CPU, or, replace it.

The ZEN 3 Architecture is broken. Start RMA asap or ask for a replacement at the store you got the combo and move to Intel.

Ryzen 5000 Crashes (WHEA Errors) will get a "Silen... - AMD Community

 

0 Likes

I do hope it's not systemic.  But now I am getting irql not less or equal, failed bugcheck and disconnected ethernet driver.

0 Likes

My system's the same - the BIOS doesn't display measured DRAM voltage, and HWinfo doesn't have a reading for it either.  Auto should be fine in BIOS if running at JEDEC speeds, but if you enable DOCP it should detect 1.35V and should display that voltage in yellow.  Or you can always key in the voltage manually if you want to be sure.  That's the problem with mobo Auto settings, they often don't tell you what setting is actually chosen so you have no way of quickly verifying.

So now you're getting blue screens?  Was this after a Windows Update?  If WU updated your NIC drivers and now you're having problems with your NIC, then there's your answer LOL.  Uninstall the update.

I disable WU driver updates using Group Policy in all my Windows installs and use manufacturer drivers instead.

gpedit.msc -> Computer Configuration -> Windows Components -> Windows Update -> 'do not include drivers with Windows Update' -> set to Enabled

The latest Intel drivers for I225-V work fine.

 

0 Likes

Thanks, good to know it's not just my board that doesn't monitor voltages. phew.  The bluescreen was super random I was using VLC at the time.  Wasn't after an update afaik.  The ONLY other time I had a "irql_not_less_or_equal" it turned out to be the GPU.  But that was on an intel.  I'm on 4 sticks of crucial 8GB 3200MHz now, 4x1R.  Should the timings be identical between each channel?  Because my B channel has 2 secondary timings that differ slightly (tRDWR and TWRRD).  Also I got the windows update KB5005565 today but that was after the crash and I have replaced my main drivers with the manufacturers versions, they do seem more up to date.  Also disabled windows driver updates.  Can I ask what RAM kit you're using please?  I've had trouble finding corsair 2x16GB that I could be sure were 2R so I assumed it would be ok if I used 4 DIMMS for 2 ranks per channel.  I might have to try this second 5600x I have.

 

A ChannelA ChannelB ChannelB Channel

0 Likes

Try a utility called BlueScreenView, it can help steer you towards what driver might have caused the blue screen.

I've got that same glitch with slightly different timings between channels with tRDWR, but not with tWRRD.  My system is stable though so I don't worry about it.

This is the kit I'm running:

https://www.corsair.com/us/en/Categories/Products/Memory/VENGEANCE-LPX/p/CMK32GX4M2E3200C16

I didn't care about ranks when I bought the kit, they just turned out to be dual rank (shown in CPU-Z) when I was checking things.

 

BlueScreenView says it was linked to ntoskrnl.exe which as I understand it is part of the NT Kernel. No system resource violations so I assume some other driver was using this windows process.  I was using VLC at the time.

0 Likes

I think this was all me.  And least I hope it was.  I shanked it and applied too much pressure to the heatsink.  It did not look pretty.  Whoops.

0 Likes

Well it crash with IRQL when it reached the stress test of diagnostic on CTR 2.1.  Then it boots up and does a stability test with said CTR 2.1 and passes...with the stress test.  Anyway vast improvement because it crashed constantly with this last night with WHEAs (cache and bus).  Just IRQL now, hopefully a driver.

0 Likes

wait have you got your ram in ports next to each other? my pc wont boot unless i use B_1 and B_2 (2nd away from cpu socket and 4th away)

0 Likes

I have all four populated at the moment. They seem to be ok (fingers crossed) so far.  If you have two DIMMS you should be using A2 and B2 afaik.  If they only work on one channel then that is an issue with the mobo, cpu or RAM itself I imagine.

0 Likes

ah ok not to worry about that then. just i get a like bios error telling me the remove and replace sticks if i have them in the other 2 ports, specifically saying it isn't "optimal"

 

IMG_20210919_120105.jpg

0 Likes

This is interesting but I have bus/interconnect errors and not the cahce heirarchy.  Does it matter? is it the same?

0 Likes