cancel
Showing results for 
Search instead for 
Did you mean: 

PC Graphics

rediske
Adept I

Hardware of Software?

I run Manjaro linux and I installed a 6700 XT in my system 2 months ago. It worked great for a while, but in the last month or so, I started getting the dreaded long beep followed by three short beeps on boot, which is the code for  a GPU hardware error (at least for an ASUS STRIX x470 F GAMING motherboard). Most of the time, the system boots without the beeps, but about once a week I get the error beeps. I have also had roughly 15 crashes, most of which I am able to reboot to temporarily resolve, but there's been at least a few times where it locks up so tight that it won't even respond to ssh logins (I wanted to login remotely to check the state of the system and reboot).

I have the latest motherboard BIOS and Manjaro is current. I'm on kernel 5.17. I turned on AMD's Smart Memory Access since my BIOS supports it and I have an AMD 5900x CPU. Could that cause this?

journalctl reports multiple errors the likes of:

[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=619235, emitted seq=619238
kernel: amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
kernel: amdgpu 0000:0b:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xf7d5edef800 flags=0x0010]
kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

The last 2 kernel errors repeat many times before a core dump and system lockup.

How can I test to see if this a hardware or software issue?

1 Solution

I cracked the case open and re-seated the card. That and the rollback fixed the hard crashes, as far as I can tell. I still got one lone amdgpu system error today:

onyx kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3

But I'll take that over dozens of hard errors that repeat themselves. It looks like I'm ok now, but time will tell.

View solution in original post

8 Replies
BigAl01
Volunteer Moderator

The first thing I would do is to try another video card if you have one you can pull from another system.  Also, is there another PCIe slot on the motherboard that you can try the 6700 XT in?  

It might be an overheating issue too, with the CPU or GPU.  How are the temperatures?  Can you provide more airflow by taking off the side panel and blowing air directly into the case from a large fan?

Lastly, maybe your PSU isn't up to the task of running your components.  I am a firm believer in overkill when it comes to sizing a PSU, and I normally go for a 1K Watt PSU if I am running a separate video card.  Your load determines how much power will be drawn from the PSU, much like your foot determines how much gas your engine will burn when you drive.


As Albert Einstein said, "I could have done so much more with a Big Al's Computer!".
0 Likes
BigAl01
Volunteer Moderator

Your error messages indicate problems with the GPU, but insufficient power or too much heat can cause crashes too.

Another thing to try is an update to the video card driver, or perhaps a downgrade to a previous version.


As Albert Einstein said, "I could have done so much more with a Big Al's Computer!".

I think power and temperature should be ok. I have a large, air cooled case with 3 fans in front, 2 on top and one in the rear. I use Mangohud and it's highest recorded GPU temp was 68 degrees. I have a Corsair 850 Watt power supply, but it is getting pretty old, running for 8 years now. I'm not quite ready to blame the PSU, but I'm not opposed to calling it into question.

Having the day off, I decided to poke the bear a bit. I was able to reproduce a lockup three times in a row while playing World of Warcraft. When I first got the 6700 XT, I manually adjusted graphics settings to my liking. Today I decided to troubleshoot by resetting the advanced graphics settings to suggested defaults. That hard crashed my PC, so I held the power button to shutdown and reboot. I went straight back into WoW to find that my changes had not been applied, so I reapplied them and hard crashed again. But this time I had taken a picture of the settings that were applied before the screen blacked out. So the third time, (for science!), I applied single settings instead of clicking the recommended button. That worked, so I went to the regular graphics settings and clicked recommended settings, which hard crashed my PC again. Each of these hard crashes had the same amdgpu errors in my original post.

BigAl01
Volunteer Moderator

What is this "AMD's Smart Memory Access"?  You implied this was enabled prior to the crashes and now the crashes are happening when you change to recommended settings.  Maybe there is some type of memory access issues.  I would disable the AMD's Smart Memory Access for now and try again.


As Albert Einstein said, "I could have done so much more with a Big Al's Computer!".

It's a technology that extends how much GPU memory the CPU can access in one chunk. That's a good point, now that I know how to reproduce the error, I can turn that off to test. It probably doesn't help WoW much, and that's my most played game.

https://www.amd.com/en/technologies/smart-access-memory

I'd be worried that the hardware is faulty. Mostly because you're saying you get errors on boot. Assuming you're turning the PC off overnight and seeing the error when you boot up the next day that rules out overheating. Assuming it's not going faulty the 850W Corsair PSU should be plenty. Again, if you're seeing error on boot that pretty much rules out overloading the PSU.

Have you tried the simplest test of removing the video card and reseating it? Sounds dumb I know but I've seen machines where that fixed the problem. Maybe because the card had vibrated a little and was just loose enough to make a bad connection at times.

Similarly,  if it's been a few years since you've built the PC it might be worth pulling the CPU cooler and reapplying fresh thermal paste and pulling and reseating the RAM.

I haven't had time to try much else, so I rolled back the ~630+ updates in Manjaro linux. I -love- the btrfs filesystem, one click, 5 seconds to restore the snapshot and a reboot later and my OS was back at the previous state.

I only had 90 min to play WoW for testing as their servers were unstable after a patch, but I noticed right away that some of the flickering textures that were randomly popping up every 20-30 minutes did not return. I have another 12 hour work day tomorrow, so I won't be able to test much until Thursday.

I cracked the case open and re-seated the card. That and the rollback fixed the hard crashes, as far as I can tell. I still got one lone amdgpu system error today:

onyx kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3

But I'll take that over dozens of hard errors that repeat themselves. It looks like I'm ok now, but time will tell.