PC Processors

robertbruce · ‎09-09-2020

Hi,

I'm having the following entries pop up in my event log, mostly when idling:

From what I know it seemed to have started yesterday for no reasons at all.

My specs are:

Ryzen 3900X @ Stock

ASUS RoG Strix x570-E

4x8 GB G.Skill F4-3600C16-8GVKC (multiple configurations tested)

RTX 2070 Super on slot 1

Got 1 NVMe SSD, an old SATA one and 2 SATA hard drives.

Built the whole thing somewhere in late june so it's pretty recent too. PC is not powered on 24/7 (far from it) and I left it at stock except for XMP RAM profiles.

I'm not getting any crashes, all of these errors are "corrected hardware errors" and just get logged while not affecting anything (my benchmark scores are unchanged too).

The person here seemed to have a similar issue but noone answered: WHEA: Cache Hierarchy Error

The most helpful advice I've seen is to actually RMA the CPU. That would be a huge bummer seeing how recent it is, how especially non-abusive I've been with it and how much fun I was having with the build until now. Which is my very first Ryzen build by the way.

I tried:

Stock RAM settings @ 2132 Mhz -> No effect
Manual RAM settings @ 2400 Mhz with very loose timings and 1.35v -> No effect
SoC voltage at 1.15v instead of 1.1v -> No effect
VDDG at 1000mV with Soc voltage at 1.15v -> No effect
CPU VRM load line calibration set to level 3 -> No effect
Mild positive CPU voltage offset (+0.03) -> No effect
Unplugging all my USB devices -> No effect
Update to the latest BIOS -> No effect, except my idle voltages seem to be even lower now
Use USB Flashback to install BIOS 1408 -> No effect (had high hopes for that one)

I checked out all that could be Windows related (Ryzen power plans, chipset & GPU drivers, ...) because booting on an old Kali Linux USB drive I had around yields the same L1 cache error in syslog. Not as many as the Windows logs will get, but still. And that's what's worrying me the most, I was hoping for a Windows issue.

Also seems to always be CPU 6 and 18 under Linux (which could be the same physical core ?).

BTW my temps are fine, the hottest Prime95 workload will only take me to about 75°C. And yields 0 errors.

Anyone having the same entries in event log?

Thanks a lot,

robertbruce · ‎09-09-2020

I was reluctant to try this because it could be considered overclocking if you stretch the definition, but I can solve the issue by setting a fixed CPU vCore to 1.3 (haven't tried lower - could also work) and fixed CPU multiplier on 38.

My conclusion would be that one of the cores/CCX or whatnot is unstable with the idle voltages.

I could apply a higher vCore positive offset but that has me worried. Also, I did try +0.065 with no result.

I tried all the memory configuration the world, with 1 to 4 sticks present and my RAM even works fine with a SoC downclock and I found out I could tighten the timings much more if I wanted (left it very conservatively set right now).

The L1 cache errors definitely seemed to only be vCore related.

I don't know if the error is fine or if this is considered a bad unit and I should RMA. I don't think the error shows up when core voltage is at normal load levels.

It's also kind of sad that the only way to safely raise idle vCore is to use some kind of static overclock.

What's also really weird is that I had the errors with c-states AND boost both disabled.

UPDATE: It doesn't only happen at idle. I tried an OCCT stability test and sat at 77°C 100% CPU for like 5 minutes with no errors reported, but I had quite a good amount of these corrected cache hierarchy errors.

I don't know what's up with that.

rumple · ‎12-06-2020

Yeah, sound like a BIOS issue. probably the AGESA code. I'm copying and pasting something from another thread, and I'm curious if this helps. Sorry if you already tried all this:

Go into BIOS. Disable CBP, save, reboot. Go back into BIOS.

Set the performance enhancer to level 3.

Disable “Re-Size BAR Support” Near “Above 4G Decoding”

RAM voltage 1.41 - 1.47 (make sure your RAM can do it. I believe only LPX won't push 1.45+)

Open the advanced tab and open AMD overclocking.

Select PBO here

set PBO to advanced and the limit to motherboard.

(The main thing) EDC - Set Spike VRM current limit to 200A. (later we can back this down, probably AGESA 1.2)

(Just in case) PPT - Set the socket power limit to 130W. (depends on mobo)

(Just in case) TDC - Set the vrm thermal limit to 85. (depends on mobo and proc)

EDC is a temporary massive increase, PPT here a decrease. TDC depends on your cooling of VRMS and such.

Leave at zeros all the rest in that menu

set PBO Scalar down to 1x or 2x.

Within the PBO curve optimizer in the BIOS, set the voltage magnitude adjustment for all cores to a positive value offset between 7 - 10.

Set Idle Voltage to Typical

Set Global C-states control to Disable

MAKE SURE THAT ECO MODE IS OFF.

NOW you can reboot, go into BIOS and set Core Precision Boost back to On, everything should work.

----------------- last guy used:

SOC volts of 1.15 , CPU voltage - auto (offset targetting around 1.37), VDDP - 960 mV, VDDG IOD - 1060 mV, VDDG CCD - 940 mV

mstfbsrn980 · ‎09-09-2020

CPU VRM load line calibration set to level 3 -> No effect
Try level 7. Or ...

Download the CPUZ software, run it and start the stabilization test from it.

Read CPU core voltage value from it.

Go to the BIOS and give manual core voltage higher than your reading.

Save settings from the BIOS with restarting...

A random one of your CPU cores stops responding momentarily. This is the problem.

robertbruce · ‎09-09-2020

My board maxes to level 5, which I did try. Same errors.

You mean the stress test from CPU-Z? Gives about 1.3v all core which effectively make the error disappear.

You're probably right about one core being annoying (I actually say so in a reponse above - probably still awaiting moderation). I'm not sure if that's a valid reason for RMAing the CPU though. It's possible that I've had this issue since the beginning or some BIOS update earlier.

mstfbsrn980 · ‎09-09-2020

CPU failure is unlikely. I suggest you try 1.40 (or 1.45) (just to find out what's wrong) for vcore and default level for VRM before RMA.

fyrel · ‎09-09-2020

WHEA errors can be really difficult to track down the source, especially if it isn't actually crashing.

The fact you are getting errors in Linux too indicates it probably is hardware related.

Did you perform any updates prior to the issue starting?

Do you have any previous versions of windows from before the events started you could roll back to and try?

robertbruce · ‎09-09-2020

There was some .NET update but nothing major. Also, as you say, since the issue is there under Linux I think there's simply one of the cores or CCX that is very slightly unstable at idle voltages.

Maybe this is considered OK for AMD? I still got plenty of warranty time, maybe I should just live with it (or do a static OC I guess).

black_zion · ‎09-09-2020

If you're not experiencing any issues, then ignore it. It could be hardware related, but it could also be BIOS or Windows related as well, and if you're not crashing or having programs result in errors, then it's not a problem you should worry about. The event viewer is full of things you shouldn't worry about, which is why it is often used by scammers to convince people they have an actual problem.

jef83 · ‎12-06-2020

https://community.amd.com/t5/drivers-software/cache-hierarchy-error/m-p/427324/highlight/false#M1348...

Ho risolto staccando i dischi esterni USB.

loc · ‎12-06-2020

Which bios version are you using? If you are running anything above 2802 I suggest you downgrade to it. Newer versions are beta for a reason and are known to cause WHEA errors.

nsargentum · ‎12-06-2020

Check out my recent post here:

https://community.amd.com/t5/processors/ryzen-5600x-system-constantly-crashing-restarting-whea-logge...

I haven't gotten into manually tuning the CPU yet, but it would be a shame if it can't run stable out of the box.

Also; I have a 3900x that has never had those issues, so it seems like some of the silicon just comes out slightly more defective and unable to function at the stock values without manual tweaking.

fasa333 · ‎02-19-2021

Oh man, yeah same difficulty right here. Happens in any video software and it has just pushed me insane the long time I had this trouble. I can not appear to find any solution despite the fact about thisthat I actually tried everything I should locate. I simply desire this is solveable...

PC Processors

WHEA / Machine check cache hierarchy error