This is my first AMD build and I was having an issue. I was hoping someone could help please? My build is
ASUS B-550F Gaming Wi-Fi
EVGA RTX 2070 ultra XC
EVGA 750GQ PSU
Crucial Ballistix BL2K8G36C16U4B 3600 MHz, DDR4, DRAM, Desktop Gaming Memory Kit, 16GB (8GB x2), CL16
The problem I have is that I get reboots (no BSOD) when idling. The most apparent example of this is playing CSGO (FS windowed) and then tabbing the second screen and walking away. CSGO at this point is lower frame rate and CPU usage. Then upon reboot I get the following WHEA error.
A fatal hardware error has occurred.
Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Bus/Interconnect Error
Processor APIC ID: 0
This means very little to me but I understand this is a more than common and can potentially be a few things. Most people tend to RMA their CPU and this solves their problem (I can do this if needs be but I can still return to store).
Before I do that though I wanted to be sure that it wasn't anything else. e.g.
RAM speed. 3600MHz is technically out of spec (too fast?)
Motherboard BIOS settings (all set to docp settings)
As there was no BSOD at all I was confused so I began to see that maybe it was Power Supply Idle Control which others had mentioned. I understand this is because at low loads the PSU thinks the computer is sleeping and kills power? I have now set it to typical and can idle with CSGO...for now.
I have the option to change my CPU in store (they have been very accommodating) but do I need to? Should I test longer? The thing is I've never known how to "simulate" a proper idle short of just walking away for hours and this doesn't seem to trigger it. it's almost like it has to go from real high load to nothing for it to happen.
If anyone can offer me some advice it would be very much appreciated.
Thanks in advance everyone
A couple of things:
- by default, Windows reboots after a blue screen. Sometimes you can get a blue screen and it reboots so quickly that you don't see the blue screen, but it was actually there. If you disable the auto reboot you'll know for sure.
- the EVGA GQ series of power supplies aren't particularly good - poor ripple control and transient response (when the PSU encounters a sudden change in load). It may or may not be the cause of your problem, but in this case I would suspect the PSU before the CPU. The higher end (more $$$) EVGA power supplies are a lot better (e.g., G3 or P2, I know EVGA's alphabet soup naming is confusing). Or just get a Corsair RMx.
- 3200Mhz is the max officially supported memory speed for these Ryzens. Anything higher is overclocking, and if you overclock you better know what all the various clocks and timings do. And Ryzen is REALLY fussy about memory.
If you know that tabbing out of CSGO can cause the crash, then that's the best way to test it. When you get the WHEA errors, is it always a bus interconnect error? If so, you can try bumping up the SOC voltage very slightly to see if it helps. But out of the pieces of hardware, I'd suspect the PSU first.
Edit: it may be that setting the power supply idle to typical might be enough. Still, if that PSU is still in the return window I'd return it and pay more for a better unit.
Thanks for replying, I have disabled auto reboot now. I will usualy get a Kernal_power critical error after the WHEA, "The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly."
-I have been testing my PSU voltages but only with a DMM which doesn't have the sample rate to pick much up tbh but it seems ok. It is coming up to 3 years old now and has worked very well in my previous, power hungry, 7700k system and that didn't use it's full capacity. But it is not a new design so probably doesn't meet spec for lower power applications. I could only RMA it at this point if needed.
-I was always concerned about the RAM speed as I know SO little about timings. It was suggested to me by a sales rep and I knew I should have gone for the max officially supported. So now I am. I'm getting the 3200MHz Crucial equivilant but with 4x8GB to give me 4 ranks.
-Yes the WHEA errors are always bus/interconnect. I'm not sure what it means by "bus/interconnect"but it is always this one. It seems really common when google it along with other people getting "Cache Hierarchy Error" but I never got those.
First let me address something that I often find mentioned in these posts.
It seems that people are particularly worried that their systems crash even while idle.
Rhetorical question: Do you know when the processor will select the highest voltage and the maximum frequency?
Answer: It is right when it is coming out of idle, and it is lightly loaded. It is precisely when it only has 1 or two things to do, that it will ratchet up the speeds and run the cores and memory controller faster than it was built for.
I'm no electrician, but in life I've noticed that electric cars run faster and lights burn brighter when you up the voltage.
I've also noticed that the more Christmas lights you put on a tree the more current it consumes.
So using the basic equation : A X V = W
We can rewrite it to be : A = W/V
So suppose I have a Program with 5 Threads and each thread will consume 10 Amps.
Further suppose that my system can only handle 100 Watts(PPT) before burning up.
So the question becomes : How high can the system raise Voltage, and still supply the necessary current to the cores?
or 50 Amps = 100 Watts / ?V
If V becomes greater than 2, then the right side of the equation will not match 50 Amps needed on the left.
So running 5 threads requiring 50 amps, with a power limit of 100W means you can't raise voltage more than 2 volts.
Now lets take a look at what happens when the load on the system doubles. You now need to run 10 threads each requiring 10 amps for a total workload of 100Amps. Your system still can only handle a power loading of 100W.
Now how much voltage can your system apply and still supply the 100Amps?
100Amps = 100W/ ?V
Answer: In order to handle the increased workload the system must drop the voltage to no higher than 1 V.
Understand that the voltage will later determine what frequency the cores will run at.
So we have shown, that the higher the load on the system, the more current(A) will be required, and the corresponding Voltage(V) will have to be lowered if the Power(PPT(W)) remains constant.
You can run more things concurrently, but you have to run them at a lower voltage and thus more slowly (lower frequency)
Warning: There are overclockers out there that would have you just raise PPT watts. However, they are totally ignoring TDP.
And I see the hands going up in the back of the class. Oh Professor, Intel and AMD say you can run their processors with a higher wattage than TDP. True but that is only temporary. You have to understand, that TDP represents the amount of Heat that Thermodynamically can pass through the materials of the Processor itself. (Ceramic Die, TIM/Solder, Metal Heat Spreader). Understand that the Heat is generated in the Die(s), it then has to flow through the Thermal Interface Material, and get into the Heat Spreader. It is only after that it gets into the Heat spreader, that it can pass through your TIM, and enter Your Heatsink and then get carried away by Air or Water currents. There is nothing that you can do external to the processor that can improve on its TDP. You can not extract Heat that has not yet made it's way through the internal materials.
The TDP for the 5900X and 5950X is 105 Watts. The TDP for 5600X is 65W. That is an internal bottleneck, like it or not. Your great cooling solution, doesn't matter here.
There was quite a ruckus a number of years ago, whether Intel was using a cheap TIM or solder for it's internal materials.
One more thing, run your memory no faster than 3200MHz. That is the maximum speed of the Internal Memory controller. It doesn't matter that the memory you bought is specified for 3600MHz, just run it without DOCP at 3200MHz.
People tend to ignore memory speed. They feel that just because they plunked down a chunk of change for their sticks, they should be allowed to ignore the specifications of the processor. They fail to understand, that the memory subsystem needs time to complete it's loads and stores. If not given enough time, memory corruption occurs. The vast majority of these crashes are due to the Memory subsystem not having enough time. It is a shame that more people don't use Error correcting memory. If they were to utilize it more, they would see the memory errors more quickly, often before the system was brought down. They want to run fast and care less about being reliable. That's a shame because with Error correcting memory, one can see exactly how far one can push his/her sticks. They don't have to guess like the people using Non-ECC.
Back to the topic:
If you have a system that boots and stays up for minutes at a time, then you are pretty close to stable. Only the correct minor adjustments need be applied.
It is possible that at low frequencies you need a bit more Voltaage. I would run the normal curve but with the slightest addition to voltage. So Set VCore from Auto to Normal. Set the differential field to the lowest positive setting +0.006V
Also set VSoc from Auto to Normal. Set the differential field to the lowest positive setting +0.006V
That should boost the voltage curve at both the low end and the High end. However we did not want to boost it at the high end.
High voltage is enticing the system to boost the frequencies of the cores to a point beyond which the memory controller can handle. So to discourage the system from boosting to higher frequencies, we will lower the power a bit.
Set PBO to Advanced. Set Limits to Manual. Set PPT down on your 5600x from 88W to 77W
Don't worry 77W is still plenty high since your processor is only rated to dissipate 65W (TDP) of heat.
By lowering PPT, the system must keep the voltage lower to assure ample current. With the lower voltage, the system will select lower frequencies and you should not get the Memory corruption. (But make sure you run Memory at spec (3200MHz)
I once got a random reboot when OCing memory, was at 3667Mhz 16-18-18-don't_remember.
Now I'm running 3200@14-16-16-36-56 and it's perfectly fine.
So yeah, set your RAM to 3200, fclk to 1600 and tighten the timings instead of going for frequency.
The guy up above has some good ideas, but since it's a huge sheet of copypaste I wouldn't blindly trust every word without testing.
Just happened again, Kernel-power 41 followed closely by WHEA error 18 (bus/interconnect) just when left after a csgo game. Getting some 3200MHz tomorrow and returning the 3600MHz just to be safe. What does bus/interconnect refer to? My IMC and/or RAM?
Most likely your infinity fabric? By default it functions based off of RAM clock, e.g. 1600MHz FCLK for 3200DDR RAM, 1800MHz for 3600. But the last one is over speck, so may be problematic. Also might be faulty RAM by itself, but very unlikely.
There are known instances of WHEA been caused by faulty CPUs too, though, so it's not like anyone can guarantee it's RAM-only problem.
Thanks, I knew I shouldn't havent gone near OCing RAM. It's not my thing lol. The FCLK is 1800MHz atm with the RAM so it could be that. I'll replace the RAM and keep testing, otherwise I guess it's CPU or PSU. I tried to understand that long post but a lot of that confused me but that might be because it's late. I swear I've read that before somewhere though. I can replace the CPU if necessary the retailer will let me.
Depends. There are JEDEC specifications for 3200MHz RAM, but 99% chance is that ordinary consumer RAM would be DOCP 3200, JEDEC 2666. Not big deal. Try defaults first, even if slow, see if you get WHEA, then go for DOCP.
Actually, if it works fine without errors, I insistently recommend manual timings. XMP/DOCP tend to use rather loose timing control, like 16-18-18 for 3200DDR, which is often much slower than the memory can do. I have Crucial Ballistix 2x32Gb kit, 3200DDR default (DOCP), 16-18-18-36-72, and at that timings it could've had 3667MHz if not for WHEA, maybe even 3800. At 3200 it's a-ok with 14-16-16-36-56. But! Before you try anything, first of all, make sure you fixed the problem.