CPU: Ryzen 9 5900x
CPU Cooler: Noctua NH-D15S
RAM: G.Skill Ripjaws V 32 GB (2 x 16 GB) DDR4-3600 (F4-3600C16D-32GVKC)
Power Supply: Seasonic Focus FM Series 750w (SSR-750FM) 80 Plus Gold
Motherboard: MPG B550 GAMING PLUS (BIOS 7C56v17)
GPU: XFX R9 290X Black Edition (Only Used part in the system - Confirmed working fine in old build each time I had to send an RMA in)
It persists across every 5900x I've put in this system - Crashing while idle. It only seems to be stable when running a game or Blender. It does this no matter what BIOS settings I change around. Positive curve offsets, Power Supply Idle Control set to Typical, PBO and PB2 disabled, C-States disabled, PSS/Cool n Quiet Disabled. It does this while the ram is at stock 2133 with no XMP/DOCP or manual overclocks. The RAM has also passed several MemTest86 USB boot tests, the ones that take like 4 hours and cycle through a whole battery of tests 4 times. G.Skill also claims that these specific sticks are compatible with the B550 gaming plus.
I've RMA'd this CPU as my troubleshooting solution so many times because of a few things: The first is that I've mostly seen resolutions with this specific issue when people RMA the CPU. Sometimes they can tweak some BIOS settings, but largely I see "Sent the CPU off to AMD, replacement working perfectly". In fact, I've seen a few stories where people bought a new PSU and Motherboard and still had the issue, which is why I've held off on that move for so long.
The second major reason is that AMD has approved the past 3 RMA requests. They said they tested them and found them faulty each time. Why should I assume the other parts in my system are to blame if AMD says they found a problem in the CPU? The main reason I'm giving up and asking for outside advice is this Corporate doublespeak in the "RMA Passed" email.
"Disclaimer: The content of this email is provided for informational purposes only. AMD makes no representation or warranties with respect to the accuracy of the content or of the information provided and reserves the right to change such information at any time without notice."
Which makes me feel like maybe it's more cost effective for them to send me a replacement without testing it. I really don't know at this point. I called my local trusted PC repair store to see if they had an AM4 CPU to test the system with but they don't. I'm considering buying a new motherboard and asking if they have other parts to test the system with. If I can't get a different CPU then surely a swap around of board, RAM, and PSU will determine if something's wrong. It's just such a hard problem to test in a reasonable amount of time. It took 2 days of off-and-on idle use for the problem to appear on this 4th attempt. How can I tell a repair shop "Yeah just browse youtube for 8 hours to see if it crashes"?
If you're going to suggest changing voltages, I've already tried positive curve offsets and honestly don't want to pump more electricity into a high-performance chip that people are undervolting for longevity's take.
Hello Bird.
Reading this seema so unbelievable to me. 3X RMA 4X Faulty? This is like **bleep**ing insane. Arent they checking the **bleep**ing RMA Prozessors or something? Iam thinking of 2 causes of this...
Im also dealing with kinda the same problem
This **bleep** is driving me insane and reading your post here makes me feel really bad for you because u even RMAd.
I really don't think they're checking the processors. I think just like how Apple finds it more cost effective to send you a new device rather than repair it, it's somehow cheaper for them to send you a new CPU without testing it after you've proven you jumped the hoops in the RMA process.
If they are checking the processors, 4 faulty ones in a row is inexcusable. I sent them a request for some sort of official documentation or statement that they did in fact test the CPUs. We'll see where that goes. I'm so sorry you're going through a similar problem. It's nauseating how frustrating this is because it's a problem at idle. So much harder to test for.
Maybe you have one of troublesome seasonic psu models that were reported, check with their support.
Do you have any more information about this? I see some results that are a year or two old about some Seasonic PSUs having issue on newer cards under load. This is an older card with problems at idle, I'm not sure if that applies if it's what you're referencing.
Something to do with sudden voltage input, who knows if that could also impact the way new ryzen also can jump to high voltage for millseconds.
Only thing i recall is 'some pre 2018 model psu' and that they posted a list of affected model/part #.
I may be wrong, but i think it was a different issue to the newer card one.
Thankyou very much for this suggestion! I've contacted Seasonic, maybe they'll have some input on my situation. I wish AMD would make this easier.
Hello, I have a similar problem with the famous WHEA-18 unfortunately, I contacted AMD SUPPORT and between trying to change components of my computer, the last thing I have left is to send my processor under warranty, unfortunately in my case I cannot cause the error ( I have a post also explaining).
I bought new parts, rams, etc but I still have the same problem and worst of all where I buy the components I doubt that they can help me since it is such a random problem that I will not be able to do anything.
As a suggestion, sorry if it sounds too daunting is to send the AMD processor and have them respond.
I've had similar issues with Ryzen5 5600x, While scouring the internet I came cross this. It's for 5900 series so it might fit the bill for you. Also I have already replaced my PSU (was a seasonic variant) that didn't solve my issue. check out this link it might help you.
I appreciate the link. It suggests updating the BIOS, but I'm already running the latest BIOS put out by MSI for my motherboard.
AMD got back to me after I requested proof of them testing my returns. It looks like it was written by someone on their smartphone. "We apologize for the incontinence caused, We see multiple RMA authorized even then Your facing the issue."
Seriously? 3 RMAs and this is the support I get? Is there a way I can get this seen by someone not following a checklist (poorly at that)?
Today i bring my troubleshooting to the next level. I switched the MSI Gaming edge X570 to a Asus tuf 570 wifi plus...Wanna know the strange part? Its running with the 5900x for 28h straight none crash. Iam doing further testing but maybe some msi bios update bricked the board leading to this issue but i dont want to speak too soon...
Update on me....Swaping the Mainboard didnt help crashed with kinda the same error again
I'm sorry to hear that, this is insane. What in the world is going on here?
It pains me to say that cases like that are why I've been so hesitant in replacing other parts of the system.
Yea idk i kinda tried everything expect swaping ram and psu ...
Still no reply from AMD customer support on this. All I want to know is if they actually tested even one CPU I sent them.
Look at the first response I got when I asked for information on my 3 RMAs.
"Thank You for Your Email.
We apologize for the incontinence caused, We see multiple RMA authorized even then Your facing the issue. Please provide us the following details such as Windows build, Dxdiag report, facing the issue on one Application or all the Applications, Ryzen master in idle and load, provide Screenshots and Any popup information.
In order to update this service request, please respond without deleting or modifying the service request reference number in the email subject or in the email correspondence below."
It reads like a spam email. What do I have to do in order to talk to someone who can actually read and address what I'm asking them? Do I have to be someone with 100k followers on twitter? Do I need to ask Linus Tech Tips to champion behind the WHEA-18 plague? Do I need to contact my province's Consumer Protection office? I'm seriously considering that third option here.
I wouldn't bother going to youtubers.. especially the likes of Linus (ie not a shill)
I would just get a refund on your purchase and go for something else. AMD don't really give a crap.. they are concerned about profit and aren't too interested in you after they get your money which is evident in the level (or lack of) support they actually offer.
That email is pretty standard from the cheap support they hire for the roles, they copy paste a big standard email, take zero initiative in looking into your rma issues etc
They will want to have you jump through troubleshooting hoops until you just give up and stop contacting them. This is literally the support system so many companies in the gaming/pc world utilize now.
Further update. I told them flat out - I am not asking for troubleshooting information, I am asking AMD to tell me if they tested the CPUs I sent them, and if they think it's acceptable that 3 RMAs in a row were approved.
This is what I got in reply.
"We apologize for the incontinence caused, We see multiple RMA authorized even then Your facing the issue. Please try with different motherboards."
I'm getting the Consumer Protection Board for my province involved.
These window hardware errors are generally caused by running your CPU or Memory too fast.
Now first cut all this nonsense.
1. Go into BIOS and Load Optimized defaults.
2. Disable Core Performance Boost (it is an Overclock)
3. Disable PBO (It too is an Overclock)
4. Do Not enable XMP or DOCP for your memory sticks. (maybe later, but not right now)
Yes your system will run a bit slower (but it will still be plenty fast)
However my belief is that it will run for a week without causing you trouble.
Most people don't know how to overclock their systems correctly and wind up blaming the hardware
It is only when you have a system running within specifications and then have a problem, that you really have a problem.
Good Luck
Hello. Maybe you could read the first post?
"It does this no matter what BIOS settings I change around. Positive curve offsets, Power Supply Idle Control set to Typical, PBO and PB2 disabled, C-States disabled, PSS/Cool n Quiet Disabled. It does this while the ram is at stock 2133 with no XMP/DOCP or manual overclocks. The RAM has also passed several MemTest86 USB boot tests, the ones that take like 4 hours and cycle through a whole battery of tests 4 times. G.Skill also claims that these specific sticks are compatible with the B550 gaming plus."
In 1998, I purchased an AMD K6 233 Socket 7 CPU. Its behavior was quite similar to your chip's. When I under-clocked the CPU to ~200 MHz, it ran fine. Later, I exchanged it for an Intel PII 233. The tech at the online store stated that: "They had issued many RMAs, due to the same failure.". As a test, have you tried under-clocking your processor? The technical term for the malfunction is that the CPU is overloaded. Probably, AMD ran-off a bad batch of 5900Xs.
I haven't yet. I'd believe that 100% as some folks have said they've fixed the issue by setting a manual speed of 3500 MHz vs the standard up-to value of 3800 MHz. The absurd luck of getting 4 overloaded CPUs in a row is almost impossible, but given I see WHEA-18/sudden crash threads pop up here almost every day it's not completely impossible.
With 4 RMAs I also suspect the motherboard, but I've been pursuing the CPU as the culprit because AMD said they tested my prior returns. I wish I could get someone who actually speaks english to discuss that fact with me.
So, if you are not overclocking at all.
The next step, with a non-overclocked system would be to make the most gentle Voltage boost possible.
Boost VCore VSOC and VRam
Change their setting from Auto to Normal, and then the displacement to the slightest positive value.
Understand that running long bouts of MemTest64 and Prime95 will not flush out all errors. Often Memory Hardware errors occur After the test is over. When the system is going Idle, it will trim back multiplier frequencies and corresponding voltatges. However now the system is running with chips that are Hot, and the idle voltage that worked before when the system was cool, does not work on Hot chips. Crashing while idle, is a symptom of too much undervolting.
How hot are your Dimms getting?
I appreciate the reply, but I'm not all that comfortable boosting the voltage. Forgive me for asking, but what's the difference between "Auto" and "Normal" in this context? I imagine Normal will allow me to set the offset, but without any offset is there any change in how the voltages operate between the two settings? At what point would I need to stop raising them if I keep crashing?
The 5000 series has been out for almost a year, this is not an early production CPU either. If I as the end user need to increase the voltages to a CPU that's already known for running hot, there's a problem. Hardcore users already undervolt the 5900x for longevity, and I'm supposed to increase the voltage on mine? That's not right.
I can't readily find any temperature readouts for my DIMMs, it seems.
We are not overclocking anything, so your CPU should be nice and cool. No more than 60C at this point running some load.
Like you said the Normal setting is just like Auto but allows you to add a differential.
If you enter Normal, and then go into the differential field, I believe you will see a drop down.
Most Motherboards give you a selection in increments of .006 volts
Start with a differential of .006
Then proceed to .012 if necessary etc.
Now you may have to do this with VCore , VSoc and/or Voltage for your Dimms
Keep an eye on your Voltages when you run Vcore can run 0.2 to 1.4
Most of the time Vcore under load will run about 1.2 but if pushing only one/two cores Ryzen might boost this to 1.4ish
Monitor with Ryzen Master. Watch the VCore and Average Vcore when you have the activity bars displayed for each core.
I would not like to see 1.4ish if the system is boosting many cores, but short spikes for 1 or 2 cores should be fine.
VSoc - should be around 1.1 V (again add the smallest displacement to this)
Voltage for your Dimms should range from 1.2 to 1.35
*************************************
I don't expect you to have to bump Vcore by more than .012V and/or VSoc by .006V and you should run fine.
I doubt you should have to bump the Dimm voltage, running at default speeds,
These should be safe voltages for anyone. The increments that I suggest are minor compared to the adjustment that the CPU will do when adjusting frequencies.
*************************************
Thankyou so much for the instructions, I really appreciate this guidance! I'll do this over the next few days and report back.
You're Welcome. Good Luck.
Update
I decided to send it in for RMA. Friendly Support helped me. It was indeed a faulty unit and iam receiving a new one.
Sadly, my attempts weren't successful. I first decided to RMA the motherboard. The replacement they sent me did not solve the crashing at idle issue.
I then followed the instructions you so kindly provided. My MSI board only let me step things in 0.0125 values. Positive offsets of both 0.0125 and 0.0250 on the Vcore and VSoC still had my system crashing.
Having tried both offsets and replacing the motherboard, I'm going to move on to trying to replace the power supply. Call a local shop, see if I can swap mine for one they have temporarily.
Can you post a HWMonitor64 jpg of the Sensors?
How fast are you running the memory?
Can you post a pic of ZenTimings ? (you can download it from web, nice compact shows settings)
Same as the first post, stock 2133 on my memory. I had to turn the offsets back on, here they are with both the Vcore and vSoC at a positive 0.0125.
Thanks for the info, it is probably going to help. Some surprises here.
First your VSoc looks dead on good, min and max.
Now your VCore at max is okay, but you have to ask yourself how often is it at max and how many cores are being dispatched when it is at max. You can get a feel for this in Ryzen Master if you expand the two CCDs.
Now lets talk about your GSkill Memory sticks. I understand that they currently are running at SPD and not overclocked. However most memory sticks running at SPD require 1.2V and yours is getting 1.196V at max. Now that's close. However when overclocked I believe GSkill allows you to push 1.35V through these sticks. So I think that while you are still running at SPD it would not hurt to push say 1.24v through the Dimms.
I have to hand it to GSkill though. The balls of them. Take some memory with an SPD rating of 1.2V, 2133MHz, tell the user to juice it up to 1.35V and expect it to run @ 3600MHz.
For some contrast I normally run Kingston ECC @ 1.2V - Their 2666MHz sticks I can bump up to 3200MHz with no additional voltage, and their 3200MHz sticks I can bump up to 3600MHz with no extra voltage. That is like 2 extra steps for each.
GSkill is asking you to push the voltage to 1.35V, so you can push 2133MHz right past 2400, 2666, 3000, 3200, 3400 and go to 3600! Like I said Balls. Take someone else's product and do what the user could have done himself. Oh, I'm sure they binned and cherry picked the best. LOL
Again for the time being keep the GSKill at SPD, you don't need the additional headache while trying to get stable.
Now your Title says the problem happens at idle. However, now looking at the latest figures you posted, it appears that I may have been looking at this all wrong. You are getting a tremendous boost. This is a product of the higher voltage. Do your minimum clocks ever fall below base? They should.
Ryzen is very funny, it can ramp up in a couple of milliseconds. It is very possible that your problem is not at idle but at boost frequencies. (I am tempted to tell you to take back the VCore +0.012 increment) Keep that on for now, but I think that particular adjustment wasn't necessary in this case). Your VSoc is fine with the .012 adjustment.
Lets trim you back from boosting to those highest frequencies.
PBO to advanced.
Limits to Manual
PPT change from 142W to 125W (As the system approaches 125W, PPT will discourage higher clocks)
Of course keep Thermal Throttle temp to whatever you set it before (I think I told you but I forget)
Oh, yeah. I found that a little odd, but I also use HWiNFO64 instead of HWMonitor. It shows the "Effective Clock" rate down quite low. I've never had to get this down and dirty with monitoring PC performance and details, but I assume this is it doing so? I also am aware of the HWiNFO64 crashes from earlier this year. I make sure mine is up to date and I don't have it run at startup, I run a stand-alone .exe and only run it when I'm actively checking something.
Here's a picture of it showing the effective clocking down. I've also just this morning set my DRAM voltage to 1.24, and set that cap on the PBO PPT to 125. I'll let you know if anything happens. Again, I appreciate the effort you're putting into helping me and others here. Way more than AMD's outsourced e-mail system has done.
I actually use HWInfo64 more than monitor now. I must have recommended Monitor just from habit.
Yeah, run with those a while. See if all clears up.
I do this for fun. Most of the time, I am typing the same recommendations over and over to different users with the same problems. I just have to trim my own sarcasm, when people get on and rag about how bad hardware is, when 99% of the time it is some setting that they themselves have set. The users don't deserve the sarcasm, they just feel bad because their expensive hardware is not functioning the way it should.
Another crash. Though I suppose I should specify, both this and the last one were slightly atypical. The screen cut to black like normal, but rather than restart it simply kept the black screen and actually kept playing the audio of the video I was watching. I had to hold down the power button to shut it off.
Couldn't control it though, spacebar to pause or alt+f4 to close did nothing. Very weird.
Never worry about what a system does after first showing signs of failure.
It would be unpredictable then, so it could show anything.
Did your event log show anything?
Nothing except the usual WHEA-18 error. Actually, that was while I was out. I must have forgotten to set something onscreen to keep it stable.
The two that didn't automatically reboot don't have a corresponding WHEA-18 error. Other than that, everything else is the typical mess of Event Viewer information that doesn't seem particularly relevant.
I've been having the exact same issues. This is inexcusable.