New build (10 days old): Ryzen9 5900X + Noctua NH-D15S on a ASUS Rog B550-F Gaming Wifi Mobo, 32 GB (2x16GB) DDR4 3600 Mhz Trident Z RGB, 1 TB PCIex4 SSD + 4 TB WD HD Sata, ASUS TUF RX 6800XT GPU. No overclock, beside A-XMP for memory and PBO on default in Bios, all that in a Phanteks P400a case (3 fans on the front as intake, 1 fan on rear panel as exhaust). As I said in a previous post, I am experiencing high temps with that build, but it seems to be more or less the norm with those components. PSU is Fractal Design Ion+ 860W 80+ Patinum.
What worries me is that when gaming, I sometimes experience random reboots: no BSOD, just a few "clicks" in my headphones and a simple reboot. Happens once or twice per day max, sometimes after 2-3 hours of gaming, sometimes sooner. For the last two days, Event manager has been reporting a chain of three WHEA-Logger Event-ID 18 when that happens (before those two last days, I could not find anything in Event manager, even when the reboots were happening). When the machine reboots, the RX 6800XT seems to be "off line": even if I unplug and replug the display port and hdmi cables on my GPU to my monitors, no signal is getting to the monitors. I have to restart the PC manually a second time (by pressing the power-on switch 8 seconds) before the GPU sends signal to the monitors again...
As I am still in the warranty period for all my components, I am trying to understand what is happening with this machine: faulty GPU ? faulty CPU ?
We can get the high temps under control. That is the easy part.
The Graphics card seeming to be off-line will be the more difficult. I am a little confused when you say it goes away when you boot it yourself. If it crashed, and then came up by itself there is no telling what state it left your devices in. So lets ignore that for now. I believe the graphics card seeming to be off-line is just a victim.
This is what I tend to advise most people, I believe it applies here to:
I would set VSoc from Auto to Normal
Then add a .006v positive differential to VSoc
Secondarily, I would set VCore from Auto to Normal
Then add a .006v positive differential to VCore
The above boosts Voltage by the smallest amounts(which your processor needs)
The amount is less than what the CPU would boost when it changes frequencies.
A very slight boost of voltage for VCore and VSoc i reccommend for most crashes. (A little goes a long way)
Turn Core Performance boost Enabled
Turn PBO from Auto to Advanced
Set PBO Limits to Manual
Turn PPT from 142w down to something like 120w
When the processor sees that it is approaching the new lower power limit, it would be discouraged from boosting to
an even higher frequency. Keep TDC at 95A, and Keep EDC at 140A.
Set "Platform Thermal Throttle Limit to Manual. Then, the next field that pops up set "Platform Thermal Throttle limit
down from 90 to 75 or whatever temperature in Celsius that you feel comfortable with
Try this, this should help, and if it doesn't totally cure it, use .012 for VSoc and Vcore differential. (still a VERY minor increment)
*****
Yes Core performance boost can make the system run very hot. Not so much PBO. If you turn them off you lose lots of performance. But Enabling them and letting them run, with a more limited Power (you don't really want to hammer that CPU do you?) and with a tighter limit on temperature. You get much of the same boost as before, even sometimes greater, but without the negative stuff. Try it. I think you will like it.
Thanks for the advices and educative bits @Gwillakers ! It is my first Ryzen high-end CPU, so I am def a noob when it comes to all those "advanced" controls. I will set that up in BIOS tonight and let you know of the results in that thread.
Quite a few findings tonight @Gwillakers... Basically, I think ASUS is VERY aggressive when it comes to default clocking on that board (ROG Strix B550-F Gaming Wifi) : before I changed anything according to what you recommended, I reached 21150 in Cinebench R23 MultiCore. I did what you suggested, even if Core Performance Boost was nested deep and was limited to Auto or Disabled (I left it on Auto), then the Cinebench score dropped to 18100, roughly 3000 points less. Right now, I sit at 142W for PPT and 150A for EDC (with a temp limit of 90 C), and I can reach 20100 points. So I do not know what the "Default/Auto" options of Asus BIOS entails in terms of optimization, but they sure seem to be very aggressive.
Even with those new settings, I barely reach 75C on Tdie CCD1 max, whereas before I was reaching 84/85C regularly I will continue to experiment and test the stability also. I may do as you suggested and put an offset of 0.012V in those two voltages you mentioned instead of the 0.006V I put right now.
I will continue to update this thread, as it may be useful to some others.
Frankly I was surprised that your performance dropped 14%.
I had a much better experience.
But tell me though, did you crash with my settings? (I thought stability was your main goal not a raised Cinebench score)
If you were disappointed in the new CB score, I would have gently raised PPT from my suggestion.
You've gone back to Full Power (PPT=142W), Original Temp (allow it to go to 90C) and raised Current to 150A.
If the new settings are stable, I would refrain from boosting Voltage for the Core and Internal Memory Controller (VSOC), any higher than the now incremented .006
Well you can continue to experiment. That's what make this so fun.
So far so good, no reboot yet. but at the same time I did not game much yesterday evening, I mostly experimented with those settings. Hopefully I will have a bit more time today to test stability.
As for going back to default settings, even pushing them a bit (EDT), the CPU temps seem to be much more in control. So I am wondering what settings ASUS is using on this board when everything is left on "Default" rather then to be manually set...
Anyway thanks once again for your help @Gwillakers, will update the thread with the stability results and mark it as solved if I do not experience any more reboots.
If you set a lower platform thermal limit, the CPU will respect it. So if you had lowered it to 75C, you'll never see temps above that as the CPU will turn down its performance to meet that limit. The same goes for the power limits. The CPU will boost up as high as it can go until it hits either its frequency, power, or thermal limits. Whichever one it hits first, that is the limit of the CPU's performance as any more would exceed it.
Motherboard 'Auto' settings can sometimes be sketchy. They're basically settings the mobo manufacturer thinks are a good starting point and sometimes these can be cranked up a notch. This isn't unusual as they like to see their mobo's perform well in comparisons and so this type of 'optimization' has been happening for many years. For example, my old Intel Haswell system (also an Asus m/b coincidentally) reboots by itself soon after getting into the Windows desktop with all default settings. After manually setting the turbo core multipliers to what Intel says they should be instead of Auto, or adding a bit of voltage, Windows is perfectly stable. So Asus was doing something sketchy with the Auto settings - prolly running all cores at the max multiplier all the time (fairly common among m/b's back then), and my particular CPU couldn't do that at stock voltage.
TLDR - it's very possible, maybe likely, that the combination of slightly aggressive default mobo settings and the silicon lottery can require a slight bit of extra voltage to be stable for some users. As i understand it, AMD bins these Ryzens very tightly so there's not too much headroom on most CPU samples.
For sure, keying in the settings manually instead of leaving them Auto is the best way to be sure you're running at the settings you expect. But it's a lot of settings...
I'm really glad to hear that you are stable.
Ryzen checks the Parameters (Throttle Temp, PPT, TDC and EDC) a thousand times a second, and with that it has the options to change (the frequencies the cores run at, How many cores get dispatched, Will two threads be scheduled in a core.. voltage for a core). So as I said in the first thread, controlling the temp is the easy part. Heck, even if you took the fan off your heatsink, Ryzen would dutifully keep the Individual core temps down.
As far as what is "ASUS" doing with the Auto settings: Now a days, it seems more under the control of AMD. I believe AMD supplies the Agesa subroutine to the Motherboard manufacturers to incorporate in their respective BIOS's. I am not sure of everything that the Agesa is responsible for, but one thing seems to be the training of the memory subsystem. The Agesa helps "train" the memory when one boots up. This is the setting of the memory timings by seeing how many slots are occupied, how many ranks and banks etc. It uses SPD info from the chips, as well as XMO and DOCP profiles.
It is because vendors now use the Agesa as a subroutine, that I, who am more familiar with Gigabyte boards, was able to explain the parameters to you using an ASUS board. The individual bios screens look so similar today.
As to what is happening at Default, with Core performance Boost and PBO enabled, I think it is a matter of having a good understanding of the Parameters (Thermal throttle, PPT power, TDC current and EDC current). Remember that CPB and PBO are Overclocking settings that Motherboard manufacturers turn on by default. They know that their boards will be compared with their competition and they want to shine. So they ship with very liberal settings. Their prime mission is to run fast for the reviewers, for they know the reviewers will be able to make them stable. They don't care about your stability, processor's longevity or your processor's warrantee that is instantly voided, just as long as their board gets the best Cinebench, 3DMark, or Time Spy score.
This is in direct contrast with the conservative path normally taken by the computing industry. You see, normally when an item ships from the manufacturer, it comes with settings so that the item will at least work, but not necessarily work best for all situations. Operating systems like Windows, Games, and Graphics cards come shipped with settings that insure that they will at least run on the cheapest piece of garbage you may own. Thus the settings do not take advantage of the best your hardware has to offer. Another example, the pre applied Thermal Paste, or Pads, are often way too much. Thermal paste, is meant to fill the "microscopic" pores between Heatsink and CPU. The thermal transfer of Aluminum is about 200W/mK, Copper about 400W/mK, Thermal Paste only 6W/mK. Using too much thermal paste is actually detrimental to heat transfer. So why do they ship with so much? They can't take the risk that their poorly machined Heatsink, or CPU is warped. They prepared for the worst case. I am of the opinion that the industry wide "pea size" application of thermal paste is way too much. One should put not much more than a greasy finger print between the two. However, if you are going to limit your application to a "greasy fingerprint" you must first apply, then remove the Heatsink to verify that the paste is transferred. The eye test is not good enough. Too many times two perfectly smooth pieces of metal do not contact when laid one on another.
Other settings such as Volume Cluster size, buffer allocation, Above 4G decoding are often tailored for the smallest systems with fewest resources (memory, hard disk space, Processor speed)
======= But I digress... Let me get back to the topic.
The Parameters chosen by Motherboard Manufactures are to apply the whip to the horse as it leaves the barn, or redline the Jag as it leaves the Auto sales Lot. Not typically how you would want to treat your animal or machine.
If you have a system that is stable enough to get you into bios, or windows for even a short while, then you have a system that is very close to being stable. It only requires a minor tweak here or there to get you fully stable.
I felt that an additional .006V to VCore and/or Vsoc would have been enough to put you stable.
You chose to add .100V, which I won't say is dangerous, but I believe you are at the limit there.
(You now risk overvolting). I will say that the voltage you chose enabled you to select a -25 unit PBO offset.
My understanding, is that the -25 offset will not use less voltage, but will enable greater frequencies at the voltage selected (Check me on this, not 100% sure)
With your additional experience, I would go back and only add .020V which would only allow you a reduced negative offset in PBO. But You do You, because everyone wants different things from their systems. You may want the bigger Cinebench scores.
Here's how I view the more important parameters: (Remember, I am not an electrician. I come from a programming and
systems tuning background. Most of my PC experience, which goes
back decades, has been gathered as a hobbiest. So take what I
say here with a grain of salt. Anyone who spots an error, please
correct me)
Thermal Throttle Limit: This one is easy. Ryzen obeys this very well. Remember Ryzen makes adjustments a thousand times a second. Your monitoring program is probably updating it's values only every second. I think this Limit is applied to each and every core. I know HWMonitor64 reports a seperate value for each core. Once a given core reaches the limit, the CPU will lower the frequency, choose not to dispatch a 2nd task on the core, or even offload the work to a cooler core.
PPT is Power. It directly rates to how fast you can run things. Like Core Multipliers, and Memory frequencies.
Now why do things run faster with more voltage? Well you might have heard of DDR Ram. Double Data Rate.
You might have seen engineering diagrams where the information is transfered on the leading edge and falling edge of a signal. (Now I could be halucinating here, but this is how I see it). I believe if you could look at the signals in super slow motion, that a memory signal would not go from 0 to 1.2 volts instantly. The signal, if you could look at it in super slo-motion would transition from 0 to .1, then .2, then .3. then .4 all the way till it got to 1.2V.
Think of this signal as a wave in the ocean. The more the wave has, the faster it can lift you up and let you down.
EDC is the current that you can dish out to the various cores. I look at EDC as the total package, and each core will take what it can get. Whereas Voltage was needed to run things fast, more current is needed to run many cores.
EDC is the total amount to give out when things are running cool.
TDC is typically a smaller package of current than EDC. When one of your cores gets to the Thermal Limit that you set,
then the Processor will stop dishing out EDC Amps, but will only dish out TDC Amps. This will reduce the amount of work being done, but it also allows you to run cooler.
One thing quoted me, that I took on faith, is that for every rise of 10 degrees Celsius, it halves the life of a chip.
So if you knock down the temps of your chips from 90C to 70C, you effectively extended it's life four fold.
My opinion:
Safe VCore voltage .2 to 1.3ish V
1.4 is pushing it a bit. However if you watch Ryzen Master, it seems it only runs 1.4ish when
a single core or maybe two cores running.
Safe VSoc 1.0 to 1.1V Used for AIO's graphics, and Internal memory controller errors.
For trouble with WHEA errors when the system is Idle, should look into adding slight voltage to VCore or VSoc.
Running Hot: should be controlled with Thermal throttle.
Crashes while overclocking(any OC over 3.6GHz) control with limiting PPT.
Good Luck
Thanks a lot for those very detailed explanations @Gwillakers and @ryzen_type_r : I learnt a lot on the Ryzen CPU ecosystem in two days thanks to you.
So far so good, with the 0.006V increase, system seems to be stable: I did manage to game for a good 3 hours yesterday evening without experiencing a reboot. So I will not push it to 0.012 V offset. I may tweak the parameters (PPT, CDT, EDT and Throttle temps) a bit more, but honestly the CPU is powerful enough for what i am doing with it right now (mainly gaming).
I will mark the thread as solved in a few days if I do not experience any more reboots.
Thanks once again, really appreciated the help !
You're Welcome.
The system should hold up for months without giving you problems. Typically for me problems appear during severe weather changes. Like the room is too hot, or too cold (frigid dry air, think static electricity).
Enjoy.
Marking the thread as solved, as I did not experience any more reboots. I think the CPU was just missing a bit of juice
Re-opening this thread, as the problem seems to be still present. I applied everything that was recommended, upping the voltages to Vcore and Vsocket by a positive offset of .012(5) V, and manually controlling the Precision Boost Overdrive settings (130W, 95A and 140A respectively).
Still, in graphically intensive games, I expect random machine reboots (every 5-10 mins in Horizon ZD, every 30-45 mins in CyberPunk 2077): no BSOD, nothing logged in the Event logs (except that the reboot was not scheduled - duh Windows!), just a black screen, a series of audio "clicks" and a reboot with the video outputs dead until I manually shut down and restart the machine...
At this point, I am thinking faulty Motherboard or PSU, but I am at a loss about what to try to pinpoint the problem...
At this point I'm more inclined to suspect the video card or PSU, since you say the reboots only happen under graphically intensive gaming.
But first, stress test the CPU and memory to rule them out. Run Small FFT Prime95 and Testmem5 separately for a coupla hours (I usually do 48-72 hours on new builds), if you don't get any errors or reboots in either, they're not likely to be the cause of the problem.
Your PSU is a good quality one, but make sure you're running two separate PCI-E power cables from the PSU to the video card. Do not use one cable and its pigtail to connect both PCI-E connectors on the video card side.
If you can get your hands on another video card, swap it out and see if you get the same problem. Otherwise, I would try to log GPU temps to see if it's a temperature related shutdown. You can also try downclocking/downpowering the video card to see if it becomes more stable.
Hey @ryzen_type_r, thanks for the answer.
The GPU is on two separate power lines from the PSU, as i remembered putting the two 8 pins on the same one could cause power instability.
I ran Small FFT in Prime 95 all night (9 hrs straight) with 0 errors and 0 warnings, machine was super stable this morning. I ran one full day of Windows memtest a few weeks back, and had no problem at all.
I began running HWinfo64 in logging mode yesterday and experienced one crash in Horizon ZD. I am no expert, but I saw in the CSV that on the GPU side, the GPU allocated memory had reached 13.5 MB (the card has 16 MB of GDDR6 RAM), and that the other indicators seemed to come down three lines before the end of line, meaning (I think) that the game crashed 6 seconds (Hwinfo64 logs one record every 2 secs by default I think) before the machine rebooted itself... Could it be an unrecoverable GPU driver fault ? And if so, why in some games and not others ? I think I will initiate a support call with Asus for my TUF 6800XT OC16 edition to see what their engineers make out of that info, what do you think ?
I think it does sound like it's a GPU issue. Whether hardware or software, difficult to say. What did the GPU temperature look like before crashing?
Just before the crash (i.e. 6 seconds before the end of file), GPU junction temperature is at 85C, which seems normal for this card, as it oscillates between 83C and 88C during the whole gaming session.
On a side note, I initiated a support call with ASUS and yesterday they were debating software issue on their end or hardware problem, hinting at a possible RMA. I will post more news as it unfolds.