Processors

Leindurstit · ‎12-26-2020

Greetings. Been having an issue with a new build and am close to pulling my hair out.

With the components listed below, I have been having severe stability issues if I load the CPU and GPU 100% at the same time. It can be reproduced very consistently running Furmark at 1280x720 in a window, and 7-Zip’s benchmark running with all 24 threads loaded (or at least 22.)

The system will abruptly (usually within a few seconds of starting the 7-zip benchmark) shut down completely. The motherboard still has power (the on-board power button LED remains lit—NOT the front-panel LED,) however the system cannot be powered back on or reset, without cycling the power supply switch.

Most interestingly, it seems that the problem manifests itself not based on the actual clock speed or even number of cores or threads, but SPECIFICALLY if the CPU is “100%” loaded or not. In an attempt to determine if it was a power draw issue, I tried turning SMT off (running only 12 cores,) turning a CCD off (running only 6 cores,) setting the clock speed to a static 3.7GHz, and keeping Precision Boost Overdrive off the whole while.

The system will remain completely stable so long as the CPU usage does not touch 100% (while the GPU is also loaded 100%.) CPU temperature never exceeds 80C, and GPU temperature stays around 69C (after letting it soak a bit with a lower total load.) Often times, the very second I try to change 7-zip to benchmarking 24 threads (up from a previous lower number,) the shutdown will occur immediately.

I also tried turning the target TDP down on the graphics card by about 15%, and a crash would still occur when the CPU was loaded 100%. Finally, I removed half of the DDR4 DIMMs, as well as trying the system with A-XMP turned off. No help.

I have updated the BIOS to MSI’s current release (I think I was getting WHEA errors on the second-most current release, too, but that seems to have subsided for now.) Also tried changing the GPU driver to a 457.xx release, instead of the current 460.xx (the system the GPU was in worked perfectly fine with the 457.xx driver.)

The only items I’m considering at this point would be a BIOS-related issue, a bad motherboard, a bad CPU, or a dying PSU. Has anyone else experienced something similar?

Below are the specs of the system:

Ryzen 9 5900X CPU
MSI X570 Creation motherboard (with current BIOS)
Windows 10 20H2 (fresh install)
Kingston KHX2933C17D4/16G RAM, DDR4-2933, 16GB x 4 (transferred from a stable previous build)
EVGA GeForce RTX 3090 FTW3 graphics card (transferred from a stable previous build)
SeaSonic Prime Ultra 850W PSU (transferred from a stable previous build)
Noctua D15S HSF

benman2785 · ‎12-26-2020

sounds to me like a "ungood" BIOS

either use an older BIOS that supports your CPU
OR
1. do a CMOS
2. load default settings
3. reboot
4. flash current bios again
5. shutdown and do a CMOS
6. load default settings
7. test again
if everything works = apply your OC ;)

PC: R7 2700X @PBO + RX 580 4G (1500MHz/2000MHz CL16) + 32G DDR4-3200CL14 + 144hz 1ms FS P + 75hz 1ms FS
Laptop: R5 2500U @30W + RX 560X (1400MHz/1500MHz) + 16G DDR4-2400CL16 + 120Hz 3ms FS

Leindurstit · ‎12-27-2020

Thank you for the suggestion, unfortunately the worst of my two current issues still appears to persist (the complete system shutdown under fully-loaded CPU and GPU.)

Interestingly enough, I was fixating so hard on that scenario that I hadn't really tried playing some video game programs to see if I would receive WHEA errors / BSODs still. Turns out I was (typically fewer than five minutes into a game like BeamNG.Drive, something where there is about 20% CPU usage total at worst, and modest graphics card use at best.)

One suggestion noted by some other folks having issues with X570 boards and Zen 3 was to turn on "Game Boost" mode in the BIOS (while leaving PBO off.)

This appears to do two things: it sets a static "overclock" and voltage on the CPU, but it also apparently may have a hand in preventing any of the cores from ever being allowed to be idle. I haven't gone back yet to see if it is "correctly" adjusting any of the other voltages (SOC, etc.) to a more-stable value compared to what the system was running before, but I am now apparently able to play video game programs for extended lengths of time now without encountering any stability issues whatsoever.

The only downside to this is the extra 30 watts of idle power consumption, and the cores are stuck at only 4175 MHz, but at least the machine appears to be stable. It will still shut down under the 100%/100% CPU/GPU load scenario, but at this point I can at least use the system for something more than just web browsing. Singular 100% load (prime95 OR furmark, not at the same time) are completely stable still.

Looks like some new AGESA versions are starting to trickle out to MSI boards (one dropped for an MSI B550 model on the 23rd.) Hopefully they roll this out to the X570 models as well, as this is very frustrating.

benman2785 · ‎12-28-2020

mh, it is strange that simultane Prime and Furmark are ok but gaming isnt
wait for the new AGESA and maybe it is fixed than

PC: R7 2700X @PBO + RX 580 4G (1500MHz/2000MHz CL16) + 32G DDR4-3200CL14 + 144hz 1ms FS P + 75hz 1ms FS
Laptop: R5 2500U @30W + RX 560X (1400MHz/1500MHz) + 16G DDR4-2400CL16 + 120Hz 3ms FS

Leindurstit · ‎12-28-2020

Prime and Furmark on their own (one or the other) are fine. When I run both at the same time, that's when I'll get a full system power-off. That's what is vexing me most: if I turn off half the cores, the same thing happens when all six cores are loaded. But I have all 12 cores enabled, I can be stressing, for instance, 11 out of 12 of them, and it'll still be fine.

Gaming seems to have been addressed with the "Game Boost" mode temporarily, until [hopefully] a new BIOS addresses that problem (or, ideally, both problems.)

benman2785 · ‎12-28-2020

mh, actually i thought its not your PSU - as the SeaSonic isnt bad...

did you enabled XMP?

PC: R7 2700X @PBO + RX 580 4G (1500MHz/2000MHz CL16) + 32G DDR4-3200CL14 + 144hz 1ms FS P + 75hz 1ms FS
Laptop: R5 2500U @30W + RX 560X (1400MHz/1500MHz) + 16G DDR4-2400CL16 + 120Hz 3ms FS

Leindurstit · ‎12-28-2020

My thought exactly (or at least, my hope.) My UPS shows the PC (including the monitor, which is maybe 20-30) draws just north of 700 watts under as high a synthetic load as I can possibly generate without it crashing.

As for XMP, I tried that both ways. Would still shut down under full 100% load on the GPU and CPU. At this point I have XMP on and it seems to be fine for standard video game programs (Observer Redux, Metro Exodus, BeamNG.Drive,) during which time the system draws total about 600 watts.

I'd like to rule out the PSU SOMEHOW not delivering enough wattage, but I don't want to bother buying the 1000-watt version if, at this point, the system is at least usable for typical workloads.

elstaci · ‎12-28-2020

I noticed that your RAM is not listed as being compatible in either MSI QVL List for Vermeer or Kingston RAM FINDER (2933 Mhz) for your motherboard: https://www.kingston.com/unitedstates/us/memory/search?model=100028&devicetype=7&mfr=msi&line=mother...

In fact there are no Kingston RAM Modules listed only Kingston's Hyper-X.

first I would reset BIOS back to its factory defaults and put everything back to it defaults.

Then I would try using just one RAM Stick to see if it continues to crash under 100% CPU load.

It is possible that your Motherboard doesn't support or is compatible with 4 DIMM Slots being populated by that specific RAM Memory. So I would first start with one stick and then maximum of 2 Ram sticks to see if it crashes while the CPU is at 100% load.

I recommend you use OCCT CPU Test first with Large and Medium Packet test and then with Small Packet test which is the best to check stability in the CPU. Also run the PSU Test which runs both the CPU and GPU Tests at the same time putting the maximum demand on the PSU.

At the top left corner where OCCT Settings is, put the Global Temperature at 96C. This will be one C over the Maximum Operating temperature of the processor 95C. That way OCCT will stop the test once it reaches 96C.

Keep a close eye on Temperatures and PSU Outputs and Fan speeds during the tests.

If you are sure the CPU crashes when it reaches 100% load you can always adjust the Maximum CPU in Windows Settings - Power plan from 100% to 99% or lower and see if it crashes if the RAM Memory doesn't fix the issue.

If you have done this already then I am sorry and I must of missed it in this thread.

Kirsebaer · ‎12-29-2020

It might be your Seasonic

https://www.jonnyguru.com/forums/showthread.php?17974-SeaSonic-shutdown-issue-hot-in-Korea

Leindurstit · ‎12-29-2020

Haven't tried OCCT before. The large dataset CPU test seemed to be fine. Small shot the power consumption up to ludicrous amounts and an extreme temperature nearly immediately (240+ watts and over 100C.) Killed that right quick--I don't think I can really test with the small dataset while running the system in its sort-of-overclocked state, wherein it otherwise currently appears to be quite stable, with the exception of the 100% CPU 100% GPU synthetic load causing a shutdown still (although I guess I shouldn't be calling it "100%" as 7-zip's CPU test, when ran on its own, never peaks the CPU beyond maybe 150 watts when running all 12 cores.)

As for the RAM--and as an aside, I guess I didn't realize Kingston had separated HyperX out as its own brand, or something--it seems to be operating normally at present, again likely related to whatever parameters the motherboard's "Game Mode" and overclock are imposing. For what it's worth, I never took those qualified/compatibility lists as an implication of "If your memory isn't on this list, it will NOT be compatible," but rather just for what it is: guaranteed compatibility, or at least "we tested this, it works with 1, 2, or 4 sticks, as indicated."

Finally, it would be quite unfortunate if Seasonic was truly at fault here. I'm still not entirely convinced, although I do have a thirstier system being powered by it now. The previous build, with an Intel i7 6850K, but with the same exact GPU, never gave me a problem like this.

pokester · ‎12-29-2020

Also test without the UPS. Plug directly into the wall. I have had them go bad before and not be delivering correct power. It would be good to eliminate that variable and make sure it is not there. If it is not there and still does it plugged into the wall.

Also you mention the system being overclocked. Reset to bios defaults and don't enable PBO or XMP and especially not Game Mode as it adds an all core OC, for the moment and test with them off, does this problem still exist?

If I would still suspect you have a power supply issue. When you run OCCT is the voltage dropping below 12v? You can use HWinfo for instance to watch in realtime the voltage as it is testing.

If you have a local retailer that sells power supplies that will take a return if you don't need it you could pick one up to test with.

Leindurstit · ‎12-29-2020

Yes, forgot to mention, I ran a few test cycles off the UPS (direct from the wall) and received the same result.

So at this point the baseline repeatable failure test with XMP off, PBO off, and even Core Boost off (so it would be stuck at the max of 3.7GHz) also led to power-off failure during a 7-zip + Furmark test. The problem especially was that, in that configuration, I would encounter WHEA BSODs in games. I can try a few more combinations of OCCT, but it looks like that can generate MORE load than what I need to kill the system.

My current situation of XMP on, PBO off, "Game Mode" on with its static overclock, has led to what feels like stability in real-world applications/games. It hasn't given me a WHEA BSOD since.

And yes, I totally would like to "borrow" a 1000W+ PSU to be able to confidently rule that out, but local generous return-policy stores seem to be in a bit of a supply shortage of such items presently (saying nothing of the three days of camping out in front of a Microcenter at 2:30 AM to finally even get the 5900X to begin with...)

For what it's worth, when testing a slow ramp-up of the number of threads tested in 7-zip, while running furmark on the GPU, I was watching the 12V input to the GPU. It never dropped below 12 (usually stayed around 12.2-12.3.) It was just crossing the 22 to 24 thread barrier that just arbitrarily kills the system dead, or at best after a few seconds (furmark goes visibly unstable at that point, with framerate dropping from 500+ to a stuttery mess.) I could actually save it a few times if I dialed back the number of tested threads quickly enough, but sometimes it would just flatline immediately. This is in spite of the CPU's power draw at 20 threads loaded PROBABLY being effectively similar to 24.

And going back to previous examples, I would encounter similar results testing with SMT off, or with only 6 cores active. It feels like an arbitrary switch. Anything less than the total number of cores/threads loaded is fine. As soon as it tests all cores, it dies. So I could run 11 out of 12 cores and be fine. But 6 out of 6 would fail. Hell, I think I'll try running with only four cores enabled when I get home today just for the fun of it.

elstaci · ‎12-29-2020

I would suggest you open a AMD Service Request (Official AMD SUPPORT) under "Warranty" and see if they believe you CPU needs to be RMAed to be checked or replaced from here: https://www.amd.com/en/support/contact-email-form

It is possible you have a defective new Processor.

pokester · ‎12-29-2020

You are hardly alone in having issues with these new CPUs, there are far too many complaints for my liking. However there are since the microcode leading up to support for the new generation began showing up in bios last summer even those of us that had no issues with Zen 2, started getting the WHEA issues. I had to regress my bios then I was fine again. That however is not a possibility for the new CPUs. To me it is pretty obvious AMD has a microcode issue and not sure why it is taking so long to get it resolved.

I think @elstaci is correct, probably time to talk to AMD about an RMA. FYI, don't mention any of the OC, PBO or XMP usage as it all voids warranty.

Good Luck!

Leindurstit · ‎12-29-2020

That's very unfortunate, certainly not encouraging for my first AMD build since Socket 939! Definitely no shortage of WHEA errors when searching for info on this problem, with Zen 2 as well.

Also regarding PBO, it seemed to be one of those "Auto, Enabled, Advanced, Disabled" settings in the BIOS. Odd that a warranty-contingent setting would garner an "Auto" setting, implying it would be on by default potentially. Odder still, even when forced to be disabled when doing some early testing, the CPU's clock speed and voltage behavior seemed to be the same, so given that I've only jumped between Auto and Disabled, I wonder if Auto = Off.

Honestly at this point I'm content with the relative stability for regular applications. It's just so unfortunate that it requires this level of cracking my head against the wall with compromises. I intentionally (since it's my first modern system that out-of-the-box happily cruises into 4+ GHz anyway) avoided touching ANY of the dang speed settings since I wanted to just enjoy what it should be doing on its own, but nope. Heck.

pokester · ‎12-29-2020

I am sorry it is an issue for you. I can honestly say that Zen2 was a great experience and I was hesitant on it as I had tried Zen and returned it because of problems. Now with Zen 3, AMD is seems to have some issues again, and whatever they did seems to now affect the Zen 2 processors as well when they had worked great. Especially with so few having been sold so far on the 5xxx chips, the complaints IMHO are scary HIGH. Same thing seems to be happening for a lot of folks on RDNA2, a lot of complaints so far when so very few have even been sold. AMD really has the opportunity to really increase market share on all fronts but instability isn't going to do that. Myself I would gladly buy something a tad slower if it means it just works. AMD needs to really focus on fixing their firmware, drivers and software features. I think there is a really good shot that the whea errors will get sorted. Although whether you want to deal with until that happens is another story that only you can decide. I returned my 8 core 1800x because it just constantly restarted and my 30 day return period was almost up and two updates in that time frame fixed nothing. So I returned it for an i7-7700k that worked and still does work with no issues. However it is not my main rig as since then I have gotten 2 3600, and a 3700x setup. Those are fine but if I load the latest bios I get the WHEA errors.

Leindurstit · ‎12-29-2020

Right, and being still within the return period for everything, the prospect of "settling" with another Intel has crossed my mind, even more so now if trying to score a "temporary" Zen 2 part runs the risk of having similar issues.

Of course, the first meaningful generational IPC increase in the industry since Sandy Bridge and it's dying on the vine like this. Plus, with such a low volume actually making it into user's hands instead of Ebay listings, it'll be even more difficult for a high enough volume of addressable problems to surface.

xesap58662 · ‎12-30-2020

WHEA errors I'm guessing are caused by your voltages being too high but most likely too low

Check you vddp vddg SOC voltages with HWINFO and ryzen master

Leindurstit · ‎12-30-2020

Updated to MSI's implementation of AGESA 1.1.9.0 that just dropped. Going for broke trying for WHEA errors: all default fresh-BIOS-reset settings, XMP enabled, and so far zero WHEA errors in places where they would show up repeatably and often before. Promising so far.

Will still need to try my "other" problem of the system shutdown under synthetic full loading of the CPU and GPU, but at least it might be otherwise stable at this point.

Processors

Ryzen 5900X Shutdown at 100% Load