CPU: amd ryzen 9950X
Mobo: Asrock Taichi Carrara X670E, Bios version v3.0.8 2024/09/20
RAM: Corsair CMK192GX5M4B5200C38 4x48=192GB RAM set
RAM Stress Testing was done with Seasonic PX-1600 PSU.
For particular test I picked up Debian 12, Linux kernel v6.1.0-25 , but I was also able to reproduce same issue with Ubuntu LTS 24.04.1.
I run my tests with default BIOS settings, no overclocking is done for RAM or CPU.
After approximately 3 hours of running RAM tests (memtester 46G on 4 CPU threads) I am able to reproduce green screen crash shown as seen in the video above.
During tests I am also connected to the PC via several ssh sessions to monitor all logs in /var/log (I have enabled old school style text logs like syslog kern.log and daemon.log).
During the crash I do not see a single suspicious log entry in any log files. However when system is crashing everything goes down (including sshd) and I am basically stuck with green screen of death (no reboot).
System is properly cooled and monitoring shows 60 degrees Celsius max during crash.
I am not able to reproduce any system stability issue while running exact same setup with old discrete AMD dGPU (Hawaii chipset). I explicitly set iGPU disabled in BIOS.
I do not have too much time for these tests and I am not really bothered with iGPU issue since I plan to run my PC with dGPU cards and iGPU disabled. Also I do not plan to spend my time to investigate this in Windows.
I am not sure if this is hardware or software or firmware issue with iGPU for my 9950X CPU.
Since BSD systems are basically using linux drivers for GPU the only alternative from software side would be to stress test with Windows, but I feel this is waste of my time . I do not plan to use Windows for this PC.
TLDR: I hope AMD can fix this annoying issue in firmware or linux kernel drivers (I hope this is not some bizarre hardware problem with my particular CPU/Mobo/RAM combo. With iGPU disabled in BIOS and external dGPU my system is very stable running such system stress tests for more than 24h without any issues at all).
I can spend some time and install any linux kernel and iGPU driver version for testing purposes you may choose if this helps to pinpoint the problem.
igpu, this is a user forum and we seldom see AMD people here. Contact AMD support here.
First your video does not play so we have almost nothing to go on. There is some Linux knowledge here but not a lot. I know very little Linux. Post all your parts and any error messages or Event Log entries (or equivalent). It would be very helpful if you could reproduce the problem on W11. John.
Thank you for reply.
I was not able to reproduce the problem with Windows 10 LTSC 2019. The problem is specific to Linux and since I do not plan to use iGPU not really relevant for me.
I also did not experience the issue on Windows.
Based on past threads, users experiencing graphic issues with the IGP are often determined to be memory related.
The Corsair Memory Specs state the 4x48 kit is XMP 3.0 ready for Intel 700 series chipset. It doesn't mention AMD compatibility or EXPO. It also seems this kit uses dual-rank modules, which typically can not attain as high of speed, particularly in a quad configuration.
The ASRock Memory QVL for Granite Ridge also does not list this memory kit, nor does it appear many quad kits have been tested on this board, and none of this density. (4x48GB)
The AMD 9950X Specs support DDR5-5600 in dual kits, or DDR5-3600 in quad configurations. Most chips IMC are capable of exceeding these speeds but those are the officially supported.
Have you tested the memory at SPD (4800) or XMP (5200) ?
Have you tried manually setting the speed lower?
Have you tested with only 2 sticks of memory installed?
Thank you for clarifying DDR5-3600 is supported in quad configuration.
You state the part is not in QVL ,but I picked up this particular RAM just because Asrock QVL actually lists this memory kit
DDR5 | Corsair | 5200 | 5200 | 48GB | CMK192GX5M4B5200C38 ver 3.53.02 | SpecTek B-die | DS | 2 | 3.06 |
Albeit in 2 DIMM configuration.
There seems to be a contradiction here. Why list CMK192... (192 in kit name stands for 192GB RAM) and use only 2 DIMMs from 4 DIMM set?
Anyhow I also installed windows 10 2019 LTSC and did 24 hour test and my system was stable with iGPU.
I even ran windows memory testing tool and it passed with flying colors. I deleted LTSC from disk as it had no other purpose than to confirm there is no hardware issue with my system.
DDR5-3600 in quad configuration is very stable and I have no complaints.
I am running my set up with external dGPU and iGPU is set to disabled in BIOS and I have no problems and I am happy with my benchmark and actual workload performance, here is output from dmidecode tool:
Memory Device
Array Handle: 0x0010
Error Information Handle: 0x001B
Total Width: 64 bits
Data Width: 64 bits
Size: 48 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL B
Type: DDR5
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 4800 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: CMK192GX5M4B5200C38
Rank: 2
Configured Memory Speed: 3600 MT/s
Minimum Voltage: 1.1 V
Maximum Voltage: 1.1 V
Configured Voltage: 1.1 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 3, Hex 0x9E
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 48 GB
Cache Size: None
Logical Size: None
So I provided my hardware information to inform anyone who is reading this that particular hardware setup is good to go (in my opinion) , but you may encounter issue if running without dGPU and I am suspicious gdm (gnome desktop manager) is possible culprit, but I did not spend enough time to confirm it. (I switched to lightdm and I was not able to reproduce bug reported in my initial post, but afterwards I disabled iGPU completely in BIOS and I did not have a chance to do extended testing with iGPU enabled)
I have to specify my RAM kit version is v5.53.13 and this particular version is *not* listed in QVL, but I am not sure to what extent minor version change is important. I was able to boot all 4 sticks only with BIOS v3.0.8. (v3.0.6 booted only with 2 DIMMs).
Unfortunately when you order RAM on Amazon you do not see kit hardware version number.
Hi! I'm experiencing the same issue with my 9950X on Linux. I have not yet found the time to dig deeper. I'm using different memory (clocked at 5600).
maxammann, memory specifications for the 9950X:
Here are my full specs:
Memory | KF560C30BBEK2-64 2 DIMMS with 32GB each |
CPU | 9950X |
Mainboard | B650 Steel Legend WiFi |
BIOS version | 3.08 |
Linux | Linux 6.6.59 #1-NixOS SMP PREEMPT_DYNAMIC Fri Nov 1 00:58:34 UTC 2024 x86_64 GNU/Linux |
Mesa | Mesa 24.0.7 |
Gnome | 46.2 |
My BIOS settings currently overclock the memory to 5600 via EXPO. The memory timings are set to Aggressive.
The kit KF560C30BBEK2 is in the QLV list of ASRock https://www.asrock.com/MB/AMD/B650%20Steel%20Legend%20WiFi/index.asp#MemoryRAP
Linux dmidecode --type 17:
# dmidecode 3.6
Getting SMBIOS data from sysfs.
SMBIOS 3.4.0 present.
Handle 0x0013, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x0010
Error Information Handle: 0x0012
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: Unknown
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL A
Type: Unknown
Type Detail: Unknown
Handle 0x0015, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x0010
Error Information Handle: 0x0014
Total Width: 64 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL A
Type: DDR5
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 4800 MT/s
Manufacturer: Kingston
Serial Number: 08136DCF
Asset Tag: Not Specified
Part Number: KF560C30-32
Rank: 2
Configured Memory Speed: 5600 MT/s
Minimum Voltage: 1.1 V
Maximum Voltage: 1.1 V
Configured Voltage: 1.1 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 2, Hex 0x98
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None
Handle 0x0018, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x0010
Error Information Handle: 0x0017
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: Unknown
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL B
Type: Unknown
Type Detail: Unknown
Handle 0x001A, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x0010
Error Information Handle: 0x0019
Total Width: 64 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL B
Type: DDR5
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 4800 MT/s
Manufacturer: Kingston
Serial Number: D4036AE0
Asset Tag: Not Specified
Part Number: KF560C30-32
Rank: 2
Configured Memory Speed: 5600 MT/s
Minimum Voltage: 1.1 V
Maximum Voltage: 1.1 V
Configured Voltage: 1.1 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 2, Hex 0x98
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None
Thanks, maxammann. I will take a look at the details in time. Have you tried running at the SPD default memory settings? If no. please do.
This reply got lost this morning, sorry. John.
Thanks! Testing with default SPD right now.
Hi misterj, the green screen and system crash just occured to me with default memory speed settings. It just gets more unlikely apparently.
When I overclock to 5600 then it happens daily. With 4800 I expierence the crash only weekly I suppose.
ASRock confirmed memory and CPU compatability. This seems like this is a hardware defect very common among these AMD GPUs.
Actually, just while writing the above I got another green screen this default BIOS settings.
my old 7950x/iGPU had screen freezing when using two monitors: HDMI and DP. The problem can be avoided for some time when changing the iGPU's dynamic memory management mode to static, this is because I don't have a dGPU.
The main cause of instability has been the refresh timings of DDR5 memories. And there is also a notable difference between the IMC capacity when the processor is cooler, with liquid cooling, or hotter, with air cooling.
Another problem is the default mode that processors have and can be solved by learning how to use PBO mode with the Curve Optimizer. When managing these and other parameters that change each core, prefer the core-by-core configuration, because the option for all cores generally doesn't work very well, because each core can handle very different parameters up or down.
I believe I already observed the crash with only a single monitor.
The other points you mentioned seem to indicate a bug in AMD processors to me (at least if settings are required when using default UEFI settings)
I'm experiencing seemingly the same issue without stress testing - system just freezes with a green screen after a few hours at moderate load.
For me this issue also appeared with normal load.
Can you post your full system specs as I did above? Which BIOS settings do you have enabled?
My issue got resolved with an upgrade to linux-6.11
Thanks! Going to try that if its compatible with ZFS already
All you posters with the "same" problem, please open a new thread posting all parts and OS. I doubt you really have the same problem. Johnb.
So far with Linux 6.11 I did not get the green freeze again. However, today the amdgpu kernel driver crashed but recovered. Maybe in 6.11 they fixed the hard crash but it still crashes softly? I'm back at 5600 RAM speed, so this situation is still way better than before.
here is the kernel crash log:
```
[ 6214.221420] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32786)
[ 6214.221428] amdgpu 0000:0d:00.0: amdgpu: in process code pid 42263 thread code:cs0 pid 42275
[ 6214.221429] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000f8f93f800000 from client 0x1b (UTCL2)
[ 6214.221431] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701430
[ 6214.221432] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa)
[ 6214.221433] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 6214.221433] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 6214.221434] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 6214.221434] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 6214.221435] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 6224.332702] amdgpu 0000:0d:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
```
It's the same problem, the memory may have some wrong timing in which the memory controller lost or did not find the information requested by the iGPU
look again at the default settings of your memory, if I looked correctly, for 5600MTs you need to adjust the voltage to 1.25v, the higher the clock, the higher the voltage required.
Hi! On which source are you basing your assumptions?
I'm using EXPO profiles which set the voltage correctly.
I'm still suspecting a kernel issue because the CPU is working correctly on Windows with a dedicated GPU. Upgrading to 6.11 reduced th frequency of crashes dramatically (from once every day to once a month) so far. Also the kernel ko longer crashes completely but the iGPU only resets.
All of my hardware is tested to work well together on Windows as least. So I'm really hoping this is a kernel issue but I'm also confident by now.