cancel
Showing results for 
Search instead for 
Did you mean: 

PC Processors

igpu
Adept I

9950X iGPU crashing linux with green screen of death during stress testing

 

CPU: amd ryzen 9950X

Mobo: Asrock Taichi Carrara X670E, Bios version v3.0.8 2024/09/20

RAM: Corsair CMK192GX5M4B5200C38 4x48=192GB RAM set

RAM Stress Testing was done with Seasonic PX-1600 PSU.

 

For particular test I picked up Debian 12, Linux kernel v6.1.0-25 , but I was also able to reproduce same issue with Ubuntu LTS 24.04.1.

 

I run my tests with default BIOS settings, no overclocking is done for RAM or CPU.

 

After approximately 3 hours of running RAM tests (memtester 46G on 4 CPU threads)  I am able to reproduce green screen crash shown as seen in the video above.

During tests I am also connected to the PC via several ssh sessions to monitor all logs in /var/log (I have enabled old school style text logs like syslog kern.log and daemon.log).

 

During the crash I do not see a single suspicious log entry in any log files. However when system is crashing everything goes down (including sshd) and I am basically stuck with green screen of death (no reboot).

 

System is properly cooled and monitoring shows 60 degrees Celsius max during crash.

I am not able to reproduce any system stability issue while running exact same setup with old discrete AMD dGPU (Hawaii chipset). I explicitly set iGPU disabled in BIOS.

 

I do not have too much time for these tests and I am not really bothered with iGPU issue since I plan to run my PC with dGPU cards and iGPU disabled. Also I do not plan to spend my time to investigate this in Windows.

I am not sure if this is hardware or software or firmware issue with iGPU for my 9950X CPU. 

Since BSD systems are basically using linux drivers for GPU the only alternative from software side would be to stress test with Windows, but I feel this is waste of my time . I do not plan to use Windows for this PC.

 

TLDR: I hope AMD can fix this annoying issue in firmware or linux kernel drivers (I hope this is not some bizarre hardware problem with my particular CPU/Mobo/RAM combo. With iGPU disabled in BIOS and external dGPU my system is very stable running such system stress tests for more than 24h without any issues at all).

I can spend some time and install any linux kernel and iGPU driver version for testing purposes you may choose if this helps to pinpoint the problem.

30 Replies
misterj
Big Boss

igpu, this is a user forum and we seldom see AMD people here. Contact AMD support here.

First your video does not play so we have almost nothing to go on. There is some Linux knowledge here but not a lot. I know very little Linux. Post all your parts and any error messages or Event Log entries (or equivalent). It would be very helpful if you could reproduce the problem on W11. John.

Thank you for reply.

I was not able to reproduce the problem with Windows 10 LTSC 2019. The problem is specific to Linux and since I do not plan to use iGPU not really relevant for me.

0 Likes

I also did not experience the issue on Windows.

0 Likes
FunkZ
Grandmaster

Based on past threads, users experiencing graphic issues with the IGP are often determined to be memory related.

The Corsair Memory Specs state the 4x48 kit is XMP 3.0 ready for Intel 700 series chipset. It doesn't mention AMD compatibility or EXPO. It also seems this kit uses dual-rank modules, which typically can not attain as high of speed, particularly in a quad configuration.

The ASRock Memory QVL for Granite Ridge also does not list this memory kit, nor does it appear many quad kits have been tested on this board, and none of this density. (4x48GB)

The AMD 9950X Specs support DDR5-5600 in dual kits, or DDR5-3600 in quad configurations. Most chips IMC are capable of exceeding these speeds but those are the officially supported.

Have you tested the memory at SPD (4800) or XMP (5200) ?

Have you tried manually setting the speed lower?

Have you tested with only 2 sticks of memory installed?

Ryzen R7 5700X | B550 Gaming X | 2x16GB G.Skill 3600 | Radeon RX 7900XT
Ryzen R7 5700G | B550 Gaming X | 2x8GB G.Skill 4000 | Radeon Vega 8 IGP
Ryzen R5 5600 | B550 Gaming Edge | 4x8GB G.Skill 3600 | Radeon RX 6800XT

Thank you for clarifying DDR5-3600 is supported in quad configuration.

You state the part is not in QVL ,but I picked up this particular RAM just because Asrock QVL actually lists this memory kit 

DDR5Corsair5200520048GBCMK192GX5M4B5200C38 ver 3.53.02SpecTek B-dieDS2 3.06 

Albeit in 2 DIMM configuration.

There seems to be a contradiction here. Why list CMK192... (192 in kit name stands for 192GB RAM) and use only 2 DIMMs from 4 DIMM set?

 

Anyhow I also installed windows 10 2019 LTSC and did 24 hour test and my system was stable with iGPU.

I even ran windows memory testing tool and it passed with flying colors. I deleted LTSC from disk as it had no other purpose than to confirm there is no hardware issue with my system.

 

DDR5-3600 in quad configuration is very stable and I have no complaints.

I am running my set up with external dGPU and iGPU is set to disabled in BIOS and I have no problems and I am happy with my benchmark and actual workload performance, here is output from dmidecode tool:

 

Memory Device
Array Handle: 0x0010
Error Information Handle: 0x001B
Total Width: 64 bits
Data Width: 64 bits
Size: 48 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL B
Type: DDR5
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 4800 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: CMK192GX5M4B5200C38
Rank: 2
Configured Memory Speed: 3600 MT/s
Minimum Voltage: 1.1 V
Maximum Voltage: 1.1 V
Configured Voltage: 1.1 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 3, Hex 0x9E
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 48 GB
Cache Size: None
Logical Size: None

 

 

So I provided my hardware information to inform anyone who is reading this that particular hardware setup is good to go (in my opinion) , but you may encounter issue if running without dGPU and I am suspicious  gdm (gnome desktop manager) is possible culprit, but  I did not spend enough time to confirm it. (I switched to lightdm and I was not able to reproduce bug reported in my initial post, but afterwards I disabled iGPU completely in BIOS and I did not have a chance to do extended testing with iGPU enabled)

0 Likes
igpu
Adept I

I have to specify my RAM kit version is v5.53.13 and this particular version is *not* listed in QVL, but I am not sure to what extent minor version change is important. I was able to boot all 4 sticks only with BIOS v3.0.8. (v3.0.6 booted only with 2 DIMMs).

 

 

Unfortunately when you order RAM on Amazon you do not see kit hardware version number.

0 Likes

Hi! I'm experiencing the same issue with my 9950X on Linux. I have not yet found the time to dig deeper. I'm using different memory (clocked at 5600).

0 Likes

 maxammann, memory specifications for the 9950X:

Max Memory Speed
2x1R DDR5-5600
2x2R DDR5-5600
4x1R DDR5-3600
4x2R DDR5-3600
If you want help, list all your parts and any relevant errors. John.
0 Likes

Here are my full specs:

Memory

KF560C30BBEK2-64

2 DIMMS with 32GB each

CPU9950X
MainboardB650 Steel Legend WiFi
BIOS version3.08
LinuxLinux 6.6.59 #1-NixOS SMP PREEMPT_DYNAMIC Fri Nov 1 00:58:34 UTC 2024 x86_64 GNU/Linux
MesaMesa 24.0.7
Gnome46.2

 

My BIOS settings currently overclock the memory to 5600 via EXPO. The memory timings are set to Aggressive.

 

The kit KF560C30BBEK2 is in the QLV list of ASRock https://www.asrock.com/MB/AMD/B650%20Steel%20Legend%20WiFi/index.asp#MemoryRAP

 

Linux dmidecode --type 17:

 

 

 

# dmidecode 3.6
Getting SMBIOS data from sysfs.
SMBIOS 3.4.0 present.

Handle 0x0013, DMI type 17, 92 bytes
Memory Device
	Array Handle: 0x0010
	Error Information Handle: 0x0012
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMM 0
	Bank Locator: P0 CHANNEL A
	Type: Unknown
	Type Detail: Unknown

Handle 0x0015, DMI type 17, 92 bytes
Memory Device
	Array Handle: 0x0010
	Error Information Handle: 0x0014
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMM 1
	Bank Locator: P0 CHANNEL A
	Type: DDR5
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 4800 MT/s
	Manufacturer: Kingston
	Serial Number: 08136DCF
	Asset Tag: Not Specified
	Part Number: KF560C30-32                   
	Rank: 2
	Configured Memory Speed: 5600 MT/s
	Minimum Voltage: 1.1 V
	Maximum Voltage: 1.1 V
	Configured Voltage: 1.1 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 2, Hex 0x98
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

Handle 0x0018, DMI type 17, 92 bytes
Memory Device
	Array Handle: 0x0010
	Error Information Handle: 0x0017
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMM 0
	Bank Locator: P0 CHANNEL B
	Type: Unknown
	Type Detail: Unknown

Handle 0x001A, DMI type 17, 92 bytes
Memory Device
	Array Handle: 0x0010
	Error Information Handle: 0x0019
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMM 1
	Bank Locator: P0 CHANNEL B
	Type: DDR5
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 4800 MT/s
	Manufacturer: Kingston
	Serial Number: D4036AE0
	Asset Tag: Not Specified
	Part Number: KF560C30-32                   
	Rank: 2
	Configured Memory Speed: 5600 MT/s
	Minimum Voltage: 1.1 V
	Maximum Voltage: 1.1 V
	Configured Voltage: 1.1 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 2, Hex 0x98
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

 

 

 

 

 

 

0 Likes

Thanks, maxammann. I will take a look at the details in time. Have you tried running at the SPD default memory settings? If no. please do.

This reply got lost this morning, sorry. John.

0 Likes

Thanks! Testing with default SPD right now.

0 Likes

Hi misterj, the green screen and system crash just occured to me with default memory speed settings. It just gets more unlikely apparently.

 

When I overclock to 5600 then it happens daily. With 4800 I expierence the crash only weekly I suppose.

 

ASRock confirmed memory and CPU compatability. This seems like this is a hardware defect very common among these AMD GPUs.

 

Actually, just while writing the above I got another green screen this default BIOS settings.

0 Likes
roupa_de_trapo
Adept III

my old 7950x/iGPU had screen freezing when using two monitors: HDMI and DP. The problem can be avoided for some time when changing the iGPU's dynamic memory management mode to static, this is because I don't have a dGPU.

The main cause of instability has been the refresh timings of DDR5 memories. And there is also a notable difference between the IMC capacity when the processor is cooler, with liquid cooling, or hotter, with air cooling.

Another problem is the default mode that processors have and can be solved by learning how to use PBO mode with the Curve Optimizer. When managing these and other parameters that change each core, prefer the core-by-core configuration, because the option for all cores generally doesn't work very well, because each core can handle very different parameters up or down.

0 Likes

I believe I already observed the crash with only a single monitor.

 

The other points you mentioned seem to indicate a bug in AMD processors to me (at least if settings are required when using default UEFI settings)

lostmsu
Journeyman III

I'm experiencing seemingly the same issue without stress testing - system just freezes with a green screen after a few hours at moderate load.

For me this issue also appeared with normal load.

 

Can you post your full system specs as I did above? Which BIOS settings do you have enabled?

0 Likes

My issue got resolved with an upgrade to linux-6.11

0 Likes

Thanks! Going to try that if its compatible with ZFS already

0 Likes
misterj
Big Boss

All you posters with the "same" problem, please open a new thread posting all parts and OS. I doubt you really have the same problem. Johnb.

0 Likes

So far with Linux 6.11 I did not get the green freeze again. However, today the amdgpu kernel driver crashed but recovered. Maybe in 6.11 they fixed the hard crash but it still crashes softly? I'm back at 5600 RAM speed, so this situation is still way better than before.

here is the kernel crash log:

```
[ 6214.221420] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32786)
[ 6214.221428] amdgpu 0000:0d:00.0: amdgpu: in process code pid 42263 thread code:cs0 pid 42275
[ 6214.221429] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000f8f93f800000 from client 0x1b (UTCL2)
[ 6214.221431] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701430
[ 6214.221432] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa)
[ 6214.221433] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 6214.221433] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 6214.221434] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 6214.221434] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 6214.221435] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 6224.332702] amdgpu 0000:0d:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered

```

0 Likes


It's the same problem, the memory may have some wrong timing in which the memory controller lost or did not find the information requested by the iGPU

 

look again at the default settings of your memory, if I looked correctly, for 5600MTs you need to adjust the voltage to 1.25v, the higher the clock, the higher the voltage required.

0 Likes

Hi! On which source are you basing your assumptions?

 

I'm using EXPO profiles which set the voltage correctly.

 

I'm still suspecting a kernel issue because the CPU is working correctly on Windows with a dedicated GPU. Upgrading to 6.11 reduced th frequency of crashes dramatically (from once every day to once a month) so far. Also the kernel ko longer crashes completely but the iGPU only resets.

 

All of my hardware is tested to work well together on Windows as least. So I'm really hoping this is a kernel issue but I'm also confident by now.

 

0 Likes

own experience. There are cases where the fclk infinity fabric influences stability, such as when it exceeds 2000MHz.

Some unstable timing may be capable of causing damage to the BIOS or operating system Kernel. You can try installing the BIOS firmware again to make sure there are no problems there. This procedure does not depend on memory or processor on AM5 boards, there are videos on YouTube demonstrating it.

Sometimes, the factory overclocking standard can become incompatible, so the only solution is to charge the memory manufacturer's warranty, you contact them and talk about this problem, and the motherboard too, especially if the memory used is in the support database.

 

there are applications that check the stability of memories, such as tm5

Release TestMem5 0.13.1 · CoolCmd/TestMem5 · GitHub

0 Likes
secondtry
Journeyman III

A crash with a green screen where the only option is to remove and reconnect the power supply is extremely frustrating.

Is there a name for this phenomenon? The Green Death?

System: AMD 9700X + DeskMini X600 + Kingston 2*48GB, not as per the QVL but as advertised by Kingston configurator. OS: Debian 12

Some applications reveal errors. In the past, I have seen weird things where even adblockers caused The Green Death. Media player mpv: ditto. I saw mpv in journalctl with innocent reports recently and deleted it completely(!). It's unlikely to solve your problem but you could try it.

0 Likes

Which kernel version are you using?

0 Likes
secondtry
Journeyman III

Out of the box...

$ uname -r
Linux ***** 6.1.0-28-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.119-1 (2024-11-22) x86_64 GNU/Linux

0 Likes

9950X here also on kernel 6.12.8 and 6.6.68.  My iGPU has been doing this same thing since I built the system a few weeks ago.  Confirmed on two different motherboards

  • MSI X670E GAMING WIFI
  • MSI 870E CARBON WIFI

Running latest xorg-server-21.1.15 and confirmed with both xf86-video-amdgpu-23.0.0 and the the modesetting video driver.

 

Most of the time, there is nothing written in the kernel log, but I did get some output last night.

 

amdgpu 0000:77:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32770)
amdgpu 0000:77:00.0: amdgpu:  in process Xorg pid 41938 thread Xorg:cs0 pid 41941
amdgpu 0000:77:00.0: amdgpu:   in page starting at address 0x0000800103a4b000 from client 0x1b (UTCL2)
amdgpu 0000:77:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601030
amdgpu 0000:77:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:77:00.0: amdgpu: 	 MORE_FAULTS: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
amdgpu 0000:77:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 RW: 0x0
systemd-timedated.service: Deactivated successfully.
amdgpu 0000:77:00.0: amdgpu: Dumping IP State
amdgpu 0000:77:00.0: amdgpu: Dumping IP State Completed
amdgpu 0000:77:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32770)
amdgpu 0000:77:00.0: amdgpu:  in process Xorg pid 41938 thread Xorg:cs0 pid 41941
amdgpu 0000:77:00.0: amdgpu:   in page starting at address 0x0000800103a4b000 from client 0x1b (UTCL2)
amdgpu 0000:77:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601030
amdgpu 0000:77:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:77:00.0: amdgpu: 	 MORE_FAULTS: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
amdgpu 0000:77:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 RW: 0x0
amdgpu 0000:77:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32770)
amdgpu 0000:77:00.0: amdgpu:  in process Xorg pid 41938 thread Xorg:cs0 pid 41941
amdgpu 0000:77:00.0: amdgpu:   in page starting at address 0x0000800103a4b000 from client 0x1b (UTCL2)
amdgpu 0000:77:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601030
amdgpu 0000:77:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:77:00.0: amdgpu: 	 MORE_FAULTS: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
amdgpu 0000:77:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 RW: 0x0
amdgpu 0000:77:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32770)
amdgpu 0000:77:00.0: amdgpu:  in process Xorg pid 41938 thread Xorg:cs0 pid 41941
amdgpu 0000:77:00.0: amdgpu:   in page starting at address 0x0000800103a4b000 from client 0x1b (UTCL2)
amdgpu 0000:77:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601030
amdgpu 0000:77:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:77:00.0: amdgpu: 	 MORE_FAULTS: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
amdgpu 0000:77:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 RW: 0x0
amdgpu 0000:77:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32770)
amdgpu 0000:77:00.0: amdgpu:  in process Xorg pid 41938 thread Xorg:cs0 pid 41941
amdgpu 0000:77:00.0: amdgpu:   in page starting at address 0x0000800103a4b000 from client 0x1b (UTCL2)
amdgpu 0000:77:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601031
amdgpu 0000:77:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:77:00.0: amdgpu: 	 MORE_FAULTS: 0x1
amdgpu 0000:77:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
amdgpu 0000:77:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 RW: 0x0
amdgpu 0000:77:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32770)
amdgpu 0000:77:00.0: amdgpu:  in process Xorg pid 41938 thread Xorg:cs0 pid 41941
amdgpu 0000:77:00.0: amdgpu:   in page starting at address 0x0000800103a4b000 from client 0x1b (UTCL2)
amdgpu 0000:77:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu 0000:77:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
amdgpu 0000:77:00.0: amdgpu: 	 MORE_FAULTS: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 RW: 0x0
amdgpu 0000:77:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32770)
amdgpu 0000:77:00.0: amdgpu:  in process Xorg pid 41938 thread Xorg:cs0 pid 41941
amdgpu 0000:77:00.0: amdgpu:   in page starting at address 0x0000800103a4c000 from client 0x1b (UTCL2)
amdgpu 0000:77:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601031
amdgpu 0000:77:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:77:00.0: amdgpu: 	 MORE_FAULTS: 0x1
amdgpu 0000:77:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
amdgpu 0000:77:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 RW: 0x0
amdgpu 0000:77:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32770)
amdgpu 0000:77:00.0: amdgpu:  in process Xorg pid 41938 thread Xorg:cs0 pid 41941
amdgpu 0000:77:00.0: amdgpu:   in page starting at address 0x0000800103a4c000 from client 0x1b (UTCL2)
amdgpu 0000:77:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu 0000:77:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
amdgpu 0000:77:00.0: amdgpu: 	 MORE_FAULTS: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:77:00.0: amdgpu: 	 RW: 0x0
amdgpu 0000:77:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
Process 41938 (Xorg) of user 0 terminated abnormally with signal 6/ABRT, processing...
Created slice Slice /system/systemd-coredump.
Started Process Core Dump (PID 43716/UID 0).
Process 41938 (Xorg) of user 0 dumped core.

Stack trace of thread 41941:
#0  0x00007c0e3c5153f4 n/a (libc.so.6 + 0x963f4)
#1  0x00007c0e3c4bc120 raise (libc.so.6 + 0x3d120)
#2  0x00007c0e3c4a34c3 abort (libc.so.6 + 0x244c3)
#3  0x000062ed059d943e n/a (n/a + 0x0)
#4  0x000062ed059da937 n/a (n/a + 0x0)
#5  0x000062ed059e18f8 n/a (n/a + 0x0)
#6  0x00007c0e3c4bc1d0 n/a (libc.so.6 + 0x3d1d0)
#7  0x00007c0e3c5153f4 n/a (libc.so.6 + 0x963f4)
#8  0x00007c0e3c4bc120 raise (libc.so.6 + 0x3d120)
#9  0x00007c0e3c4a34c3 abort (libc.so.6 + 0x244c3)
#10 0x00007c0e3c4a33df n/a (libc.so.6 + 0x243df)
#11 0x00007c0e3c4b4177 __assert_fail (libc.so.6 + 0x35177)
#12 0x00007c0e30aa9cb9 n/a (libglamoregl.so + 0x1dcb9)
#13 0x00007c0e30aafce1 n/a (libglamoregl.so + 0x23ce1)
#14 0x00007c0e30a95e3a glamor_set_pixmap_texture (libglamoregl.so + 0x9e3a)
#15 0x00007c0e30a9601f glamor_egl_create_textured_pixmap_from_gbm_bo (libglamoregl.so + 0xa01f)
#16 0x00007c0e3ba14529 n/a (amdgpu_drv.so + 0x18529)
#17 0x00007c0e3ba0fb31 n/a (amdgpu_drv.so + 0x13b31)
#18 0x000062ed05a11567 n/a (n/a + 0x0)
#19 0x00007c0e3bd2ec0f n/a (libglx.so + 0xcc0f)
#20 0x000062ed059fd94f n/a (n/a + 0x0)
#21 0x000062ed059da9dc n/a (n/a + 0x0)
#22 0x000062ed059e18f8 n/a (n/a + 0x0)
#23 0x00007c0e3c4bc1d0 n/a (libc.so.6 + 0x3d1d0)
#24 0x00007c0e3c5153f4 n/a (libc.so.6 + 0x963f4)
#25 0x00007c0e3c4bc120 raise (libc.so.6 + 0x3d120)
#26 0x00007c0e3c4a34c3 abort (libc.so.6 + 0x244c3)
#27 0x00007c0e399e2fa3 n/a (libgallium-24.3.3-arch1.1.so + 0x9e2fa3)
#28 0x00007c0e399e632c n/a (libgallium-24.3.3-arch1.1.so + 0x9e632c)
#29 0x00007c0e394de824 n/a (libgallium-24.3.3-arch1.1.so + 0x4de824)
#30 0x00007c0e3951369d n/a (libgallium-24.3.3-arch1.1.so + 0x51369d)
#31 0x00007c0e3c51339d n/a (libc.so.6 + 0x9439d)
#32 0x00007c0e3c59849c n/a (libc.so.6 + 0x11949c)

Stack trace of thread 41950:
#0  0x00007c0e3c50fa19 n/a (libc.so.6 + 0x90a19)
#1  0x00007c0e3c512479 pthread_cond_wait (libc.so.6 + 0x93479)
#2  0x00007c0e3951376e n/a (libgallium-24.3.3-arch1.1.so + 0x51376e)
#3  0x00007c0e394de74c n/a (libgallium-24.3.3-arch1.1.so + 0x4de74c)
#4  0x00007c0e3951369d n/a (libgallium-24.3.3-arch1.1.so + 0x51369d)
#5  0x00007c0e3c51339d n/a (libc.so.6 + 0x9439d)
#6  0x00007c0e3c59849c n/a (libc.so.6 + 0x11949c)

Stack trace of thread 42226:
#0  0x00007c0e3c50fa19 n/a (libc.so.6 + 0x90a19)
#1  0x00007c0e3c512479 pthread_cond_wait (libc.so.6 + 0x93479)
#2  0x00007c0e3951376e n/a (libgallium-24.3.3-arch1.1.so + 0x51376e)
#3  0x00007c0e394de74c n/a (libgallium-24.3.3-arch1.1.so + 0x4de74c)
#4  0x00007c0e3951369d n/a (libgallium-24.3.3-arch1.1.so + 0x51369d)
#5  0x00007c0e3c51339d n/a (libc.so.6 + 0x9439d)
#6  0x00007c0e3c59849c n/a (libc.so.6 + 0x11949c)

Stack trace of thread 41984:
#0  0x00007c0e3c50fa19 n/a (libc.so.6 + 0x90a19)
#1  0x00007c0e3c512479 pthread_cond_wait (libc.so.6 + 0x93479)
#2  0x00007c0e3951376e n/a (libgallium-24.3.3-arch1.1.so + 0x51376e)
#3  0x00007c0e394de74c n/a (libgallium-24.3.3-arch1.1.so + 0x4de74c)
#4  0x00007c0e3951369d n/a (libgallium-24.3.3-arch1.1.so + 0x51369d)
#5  0x00007c0e3c51339d n/a (libc.so.6 + 0x9439d)
#6  0x00007c0e3c59849c n/a (libc.so.6 + 0x11949c)

Stack trace of thread 41938:
#0  0x00007c0e3c5961fd syscall (libc.so.6 + 0x1171fd)
#1  0x00007c0e394d0c3b n/a (libgallium-24.3.3-arch1.1.so + 0x4d0c3b)
#2  0x00007c0e394de3e1 n/a (libgallium-24.3.3-arch1.1.so + 0x4de3e1)
#3  0x00007c0e399cacb4 n/a (libgallium-24.3.3-arch1.1.so + 0x9cacb4)
#4  0x00007c0e39700db1 n/a (libgallium-24.3.3-arch1.1.so + 0x700db1)
#5  0x00007c0e390c7f29 n/a (libgallium-24.3.3-arch1.1.so + 0xc7f29)
#6  0x00007c0e3ba06821 n/a (amdgpu_drv.so + 0xa821)
#7  0x000062ed058f9ecc n/a (n/a + 0x0)
#8  0x000062ed059df881 n/a (n/a + 0x0)
#9  0x000062ed05908a72 n/a (n/a + 0x0)
#10 0x000062ed05933d65 n/a (n/a + 0x0)
#11 0x000062ed0594aee4 n/a (n/a + 0x0)
#12 0x000062ed0594ddd2 n/a (n/a + 0x0)
#13 0x000062ed05960572 n/a (n/a + 0x0)
#14 0x000062ed0596137b n/a (n/a + 0x0)
#15 0x000062ed0596331b n/a (n/a + 0x0)
#16 0x000062ed05962411 n/a (n/a + 0x0)
#17 0x000062ed058b819d n/a (n/a + 0x0)
#18 0x00007c0e3c4a4e08 n/a (libc.so.6 + 0x25e08)
#19 0x00007c0e3c4a4ecc __libc_start_main (libc.so.6 + 0x25ecc)
#20 0x000062ed058b86d5 n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

systemd-coredump@0-43716-0.service: Deactivated successfully.
systemd-coredump@0-43716-0.service: Consumed 550ms CPU time, 522.2M memory peak.
X connection to :0 broken (explicit kill or server shutdown).
X11 I/O error handler called
X11 I/O error exit handler called, preparing to tear down X11 modules
dbus-:1.11-org.a11y.atspi.Registry@6.service: Main process exited, code=exited, status=1/FAILURE
dbus-:1.11-org.a11y.atspi.Registry@6.service: Failed with result 'exit-code'.
xfce4-notifyd.service: Main process exited, code=exited, status=1/FAILURE
xfce4-notifyd.service: Failed with result 'exit-code'.
traps: xfsettingsd[43731] trap int3 ip:7de5029e7163 sp:7fffac68d5b0 error:0 in libglib-2.0.so.0.8200.4[6b163,7de50299a000+a6000]
Process 43731 (xfsettingsd) of user 1000 terminated abnormally with signal 5/TRAP, processing...
Started Process Core Dump (PID 43732/UID 0).
Process 43731 (xfsettingsd) of user 1000 dumped core.

Stack trace of thread 43731:
#0  0x00007de5029e7163 g_log_writer_default (libglib-2.0.so.0 + 0x6b163)
#1  0x00007de5029de5e8 g_log_structured_array (libglib-2.0.so.0 + 0x625e8)
#2  0x00007de5029de85f g_log_structured_standard (libglib-2.0.so.0 + 0x6285f)
#3  0x0000566684489387 n/a (n/a + 0x0)
#4  0x00007de502564e08 n/a (libc.so.6 + 0x25e08)
#5  0x00007de502564ecc __libc_start_main (libc.so.6 + 0x25ecc)
#6  0x0000566684489725 n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

systemd-coredump@1-43732-0.service: Deactivated successfully.
Process 43763 (xfsettingsd) of user 1000 terminated abnormally with signal 5/TRAP, processing...
traps: xfsettingsd[43763] trap int3 ip:7e7bf594c163 sp:7ffc406a9e70 error:0 in libglib-2.0.so.0.8200.4[6b163,7e7bf58ff000+a6000]
Started Process Core Dump (PID 43765/UID 0).
Process 43763 (xfsettingsd) of user 1000 dumped core.

Stack trace of thread 43763:
#0  0x00007e7bf594c163 g_log_writer_default (libglib-2.0.so.0 + 0x6b163)
#1  0x00007e7bf59435e8 g_log_structured_array (libglib-2.0.so.0 + 0x625e8)
#2  0x00007e7bf594385f g_log_structured_standard (libglib-2.0.so.0 + 0x6285f)
#3  0x000057f1f3e1b387 n/a (n/a + 0x0)
#4  0x00007e7bf54c9e08 n/a (libc.so.6 + 0x25e08)
#5  0x00007e7bf54c9ecc __libc_start_main (libc.so.6 + 0x25ecc)
#6  0x000057f1f3e1b725 n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

systemd-coredump@2-43765-0.service: Deactivated successfully.
traps: xfsettingsd[43771] trap int3 ip:7346df34c163 sp:7ffdc1b105e0 error:0 in libglib-2.0.so.0.8200.4[6b163,7346df2ff000+a6000]
Process 43771 (xfsettingsd) of user 1000 terminated abnormally with signal 5/TRAP, processing...
Started Process Core Dump (PID 43774/UID 0).
Process 43771 (xfsettingsd) of user 1000 dumped core.

Stack trace of thread 43771:
#0  0x00007346df34c163 g_log_writer_default (libglib-2.0.so.0 + 0x6b163)
#1  0x00007346df3435e8 g_log_structured_array (libglib-2.0.so.0 + 0x625e8)
#2  0x00007346df34385f g_log_structured_standard (libglib-2.0.so.0 + 0x6285f)
#3  0x00005e4dcdde1387 n/a (n/a + 0x0)
#4  0x00007346deee5e08 n/a (libc.so.6 + 0x25e08)
#5  0x00007346deee5ecc __libc_start_main (libc.so.6 + 0x25ecc)
#6  0x00005e4dcdde1725 n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

systemd-coredump@3-43774-0.service: Deactivated successfully.
Process 43780 (xfsettingsd) of user 1000 terminated abnormally with signal 5/TRAP, processing...
Started Process Core Dump (PID 43783/UID 0).
Process 43780 (xfsettingsd) of user 1000 dumped core.

Stack trace of thread 43780:
#0  0x00007a17362ec163 g_log_writer_default (libglib-2.0.so.0 + 0x6b163)
#1  0x00007a17362e35e8 g_log_structured_array (libglib-2.0.so.0 + 0x625e8)
#2  0x00007a17362e385f g_log_structured_standard (libglib-2.0.so.0 + 0x6285f)
#3  0x00005da11a451387 n/a (n/a + 0x0)
#4  0x00007a1735e85e08 n/a (libc.so.6 + 0x25e08)
#5  0x00007a1735e85ecc __libc_start_main (libc.so.6 + 0x25ecc)
#6  0x00005da11a451725 n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

systemd-coredump@4-43783-0.service: Deactivated successfully.
Process 43789 (xfsettingsd) of user 1000 terminated abnormally with signal 5/TRAP, processing...
Started Process Core Dump (PID 43793/UID 0).
Process 43789 (xfsettingsd) of user 1000 dumped core.

Stack trace of thread 43789:
#0  0x000072030d2ec163 g_log_writer_default (libglib-2.0.so.0 + 0x6b163)
#1  0x000072030d2e35e8 g_log_structured_array (libglib-2.0.so.0 + 0x625e8)
#2  0x000072030d2e385f g_log_structured_standard (libglib-2.0.so.0 + 0x6285f)
#3  0x0000624487af8387 n/a (n/a + 0x0)
#4  0x000072030ce85e08 n/a (libc.so.6 + 0x25e08)
#5  0x000072030ce85ecc __libc_start_main (libc.so.6 + 0x25ecc)
#6  0x0000624487af8725 n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64

systemd-coredump@5-43793-0.service: Deactivated successfully.

 

0 Likes

To other affected users, please try adding the following kernel boot parameter: amdgpu.vm_fault_stop=0

 

It disables a stop by the driver with virtual memory faults.  The zero value tells the driver to to log the faults and may provide some debugging clues.

Greatyl, I'm going to enable this just in case it crashes again.

0 Likes

On https://www.kernel.org/doc/html/v4.20/gpu/amdgpu.html I read the default is 0. That is a bit confusing - but that page is old.

On https://docs.kernel.org/gpu/amdgpu/debugging.html vm_fault_stop is described. Could you elaborate a bit on why you suggest this? If I search with 'journalctl' for "gfxhub", I cannot find an entry so is there a VM fault in play here?

The last link references to UMR for AMD GPU debugging. I haven't tried that, a bit out of my comfort zone, but did someone?

0 Likes