I've been plagued, since the 1 year I've had this system for with the notorious AMD crashing when idle issue. The issue is simple: my system crashes when idling, but is stable when in use and under any kind of load.
About once a week, I leave my system idling for a while, my monitors go to sleep, and when I get back to it, I find it unresponsive.
A bit of googling about this leads me to issues where AMD processors and motherboards (not sure where the fault lies), are unstable with C/P states, and a fix that seems to have worked for people is setting:
Zen common options> Power supply idle control to "Typical current idle".
This seems to be the recommended solution.
However, in my case, this has only made my system more stable than before (I'd have crashes as frequently as once every 2-3 days, and with the power supply setting, I have it once every 1-2 weeks.). It has not made the system reliable.
I've tried everything at this point:
- Trying out a higher power PSU (750VA -> 1000VA)
- Trying out various Linux kernels & the rcu_nocbs params
And I'm honestly on the verge of giving up.
As a last resort, I've written to AMD and asked for an RMA request, which they have approved. In my case, I live in Belgium and the processor needs to be shipped back and forth to the Netherlands, which is a process that might take up to 3 weeks (from previous estimates I've seen on the web). This means a serious downtime for this machine.
At the expense of this being yet another one of the posts about AMD processor and power-related crashes, I'd like to ask the community what my options are.
- Is an RMA going to be worth it, or likely to fix the issue? This might be the case if the issues were related to some batches of processors. I have heard about the segfault bugs being fixed in latter iterations of the same processor, for example.
- Do I have other routes I can take to try and fix this?
- Is there anything from AMD explaining what the issues with these processors are?
- The AMD support person I spoke to asked me to try out the power settings workaround (and disabling C states), which didn't solve my issue. Upon told in a following reply that the workaround didn't help, they immediately suggested I request an RMA, which seems to me that AMD is aware of issues like these?
- Are there other users of AMD that have RMA'd their processor and what's their story?
My hardware configuration:
- Processor cooler: Noctua (x1) NH-U14S TR4-SP3
- Cooler for case: Noctua (x4) NF-A14 PWM
- Memory: Corsair Vengeance LPX (4x 8GB)
- PSU: CORSAIR RMx Series RM750x - 2018 Edition
- Extra processor cooler: Noctua NF-A15 PWM (x1)
- HDD: Samsung 970 EVO Plus 500GB
- Motherboard: ASRock X399M TAICHI
- GPU: Gigabyte Radeon RX 580 GAMING 4GB
- Case: Fractal Design Define C TG - Midtowermodel