cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

nischese
Journeyman III

3950x random BSOD

I have used the 3950X on a Gigabyte Aorus Pro X570 for about 11 moths now. It is impressively strong in multithreaded tasks, but also quite inefficient in single thread over 4.5 GHz core clock.

The first 4-5 months it worked flawlessly, but then I started getting a random BSOD every other week or so, and the indicated cause was different every time and cannot be traced to any specific driver or service. The problem has gradually increased and now I'm up to 4-5 BSOD a day.

It has been running at default settings all the time since the beginning(just to eliminate overclocking and tweaking as problem causes), it's only in the last couple of days I've started to adjust clock-settings to try to identify the problem.

All this time I've been trying all means I can come up with to solve the problem(RAM-test, underclocking of RAM, stress-test of CPU and GPU, newest BIOS(gone through about 10 versions), newest drivers, disk-test, system files check, uninstalling unused apps, changing apps, virus-, and malware-check with several different applications, replacing PSU, replacing network adapter and mouse and so on, but to no avail until I noticed that if I limited the CPU to run at base clock at 3.5 GHz it is rock solid, but allowing turbo clock soon leads to a BSOD.

It's stable at 100% load on all cores(at 4.1-4.2 GHz) for hours and passes the Ryzen Master stress-test without problems, so it is not a general heating problem. It is water-cooled, idles at 31-33 degrees C and I've never seen over 70 degrees C on the CPU even at stress-tests.

I have never during these months experienced a BSOD during gaming, usually it happens during light load desktop work, often during a mouse drag/sweep.

My suspicion is that some of the cores have degraded and no longer can cope with 4.7 GHz maximum single thread clock, so one test I would like to do is to decrease the maximum P0-state clock to 4.2(where it is stable with all cores running at load)-4.6 GHz to see where I start to get problems, but how would I go about doing that?

As far as I can see, I can raise the P0-state maximum clock by 0-200 MHz in Ryzen Master, but I cannot lower it. So, how can I lower the P0-state maximum clock? Is it at all possible? In BIOS there is a choice between P0, P1 and P2, but once chosen you cannot do anything with it.

If I had to choose between 4.7 GHz maximum clock with 4-5 BSOD a day, a stable system at 3.5 GHz base clock only and a stable system at 4.4 GHz maximum clock it's pretty sure the latter would win.

Any suggestions on how to solve this are much appreciated. I've been fighting this problem for 6 months now and I'm getting pretty tired of it.

0 Likes
17 Replies
Craig9080
Journeyman III

Every BSOD I have ever had on my 3950x was down to memory.  I see that you have gone over your RAM in detail but it might be worth trying out another set if you can swing it. What is the BSOD error?  Another thing to do is deep dive your Windows event logs.  I would look at the 2-5 min before a BSOD and see of you are getting any hardware errors.  

0 Likes

Thanks for your reply,

Yes, I have tested the RAM for days both under Windows and using DOS and linux tools, both in standard and XMP. Have also shifted slots to no avail. Guess I could try to run just one memory stick at a time and alternate between the two I have.

The indicated BSOD error-cause is different every time(or rather random from a large set of causes) and never even mentions an offending driver, so no clue there.

I have gone over the event logs thoroughly, and came up with nothing. The system just stops, leaving no previous trace in the logs, just the usual "The previous system shutdown... was unexpected" after reboot.

Yesterday I tried the ClockTuner for Ryzen (1.1) for the first time and it seems to give some support to my core-degradation hypothesis as it indicates poor energy efficiency for all 4 CCX:es.

As it turns out my system cannot even run the diagnostic in CTR reliably(a characterization of the CPU to find a suitable starting point and it starts low so that even the wost cores should be able to keep up). At default core voltage I get an immediate BSOD about every other time I try. Tried several times and got different indicated BSOD-causes every time, just like ususal, so I might be onto something there. Despite beeing an overclocking tool CTR suggests that I need to LOWER the clock 25-100 MHz to be able to run stable att default core voltage!

Must say I'm a bit disappointed if a 750 USD CPU is due for replacement after just 11 months.

I'm going to try this tool out some more before I decide what to do, still learning how to use it!

 

 

0 Likes

Hello, @nischese - I'm wondering if you have figured out more?

I have encountered something similar and turning off CPU Core Performance Boost seems making it stable.

0 Likes

can you post the BSOD errors you are getting?

Sometimes that shows which hardware is involved by the error.

When I had one bad Ram stick out of four I was getting BSODs all day long. The BSODs were basically about 3 different BSOD errors. It indicate to me that it might be a RAM issue since I was getting BSOD while doing In-Grade Windows Repair or installing apps.

0 Likes

Thanks for your reply, @elstaci - I'll capture the errors when BSOD happens again.

Each time it was different, and not very obvious. Usually when I am typing or selecting a word to copy.
Sometimes in a Zoom meeting, but not exclusively.

Are there some particular configuration settings to help collecting error data?
I just did a fresh installation of the latest Windows 11 22H2. Right now, upon system failure it does small memory dump, writes an event and automatically restarts. I probably should uncheck the automatically restart box...

I did install the latest AMD Chipset 4.09.23.507, but I did not install any software from Gigabyte.

My old installation is still there (dual boot, until I finish transferring data) so I can try to see if the past errors are in the system log.

I do have 4 sticks of RAM and run XMP - just to get to 3200 MHz. The rest of settings are really default, except I turned on SVM to run some Hyper-V VMs. How can I find if there is a bad stick?

0 Likes

Try doing CMOS CLEAR first which puts your BIOS back to "Default" settings and see if it fixes your issue. If it does and the only thing you changed was your RAM using XMP then you can troubleshoot that part of BIOS with your RAM.

Also is your RAM listed in your Motherboard's QVL List for your Processor or at the RAM's Manufacturer's site or AMD RAM List?

If you believe it is a RAM issue try using just one RAM stick to see if the BSOD stops.

You can use MEMTEST86 from a USB Drive or use Windows Memory Diagnostics.

Download a free program which I use for BSODs called "Who Crashed":  https://www.resplendence.com/whocrashed  or BLUE SCREEN VIEWER: https://www.nirsoft.net/utils/blue_screen_view.html

That tells you the BSOD error by reading your Windows Crash logs.

Also you can look at Windows Event Viewer under "Error" to see if you find anything common that might be causing your BSODs.

Another very useful tool is running DXDIAG.exe and downloading the file to your computer. Then when you open the DXDIAG file go to the last Category. There is will show you all the files that are having issues.

Try running in a elevated Command Prompt or Powershell the following command to check the integrity of your Windows OS: SFC /scannow  But since you did a Fresh Install of Windows 11 it probably is fine but you can still run it it case it is corrupted due to hardware issue.

I agree disable "Automatic Restart" otherwise it might be difficult to see what error you have or enter BIOS immediately after.

I have mine disabled.

0 Likes

Okay I didn't realize this was an old AMD Thread. But the tips are still valid.

0 Likes

Many BSODs could be either software (Driver) or Hardware issues including overheating or power issues.

Download OCCT and run all 3 stress test and see if you get the BSOD while running those tests. That might give you a general idea where to troubleshoot.

0 Likes

Thank you very much for all the tips, @elstaci - I'll try them and report back.

Deemon69
Elite

Good day!

I think that everything is in order with your processor. Since the motherboard is Gigabyte, the reason for this behavior of the PC is somewhat different. In short, you need to flash the BIOS in DOS mode (pure DOS in legacy mode). For firmware, use not the white usb connector, but the black usb 2.0. The flash drive itself must be no newer than usb 2.0. Before updating the BIOS, we reset it to default, after the update we also reset it to default settings. After updating the BIOS, we test.

0 Likes

Thank you @Deemon69 - although the BIOS already has the latest Gigabyte BIOS firmware version, I tried the following before:

- Shut down the PC
- Pressed COMS Reset button 
- Powered on the PC and went to BIOS, saw "BIOS was reset" message
- Inserted into a USB 2.0 (black) port - a USB 2.0 stick carrying the latest firmware
- Used BIOS Q-Flash UI to update (the same) firmware 36f

Since you mentioned DOS mode, I'll try to use Rufus to make a bootable DOS USB stick.
I see the firmware's zip file has an autoexec.bat with: Efiflash X570AOMA.36f

Is there a need to downgrade the firmware, then update to the latest? Just in case updating to the same version is unexpected?

0 Likes

Good day!

Yes, create a bootable USB flash drive with FreeDOS using Rufus, rename the autoexec.bat file that comes with the BIOS, before copying it to the USB flash drive, for example, rename it to start.bat. After booting from the flash drive, just write start on the command line and don't touch anything else. The BIOS can be flashed to any version, but I recommend the latest. 

After updating the BIOS, I recommend reinstalling the drivers. Download the drivers from the Gigabyte website. Later you can download from the AMD site, but the first installation must be from the Gigabyte site.

BT-22
Adept I

@elstaci - here are a few things I found:

BlueScreenView showed 2 different errors: IRQL_NOT_LESS_OR_EQUAL and SYSTEM_SERVICE_EXCEPTION
(There may have been another error, but I have only 2 minidump files, the older ones were deleted) 

In the event viewer I can find:
Error,10/26/2022 11:01:04 PM,Microsoft-Windows-WER-SystemErrorReporting,1001,None,"The computer has rebooted from a bugcheck. The bugcheck was: 0x0000000a...
Error,10/26/2022 1:18:37 AM,Microsoft-Windows-WER-SystemErrorReporting,1001,None,"The computer has rebooted from a bugcheck. The bugcheck was: 0x0000003b...
Error,10/24/2022 7:48:31 AM,Microsoft-Windows-WER-SystemErrorReporting,1001,None,"The computer has rebooted from a bugcheck. The bugcheck was: 0x1000007f...
Error,10/18/2022 3:18:08 PM,Microsoft-Windows-WER-SystemErrorReporting,1001,None,"The computer has rebooted from a bugcheck. The bugcheck was: 0x00000050...
Error,10/6/2022 8:38:00 PM,Microsoft-Windows-WER-SystemErrorReporting,1001,None,"The computer has rebooted from a bugcheck. The bugcheck was: 0x1000007f...
Error,10/3/2022 3:14:52 PM,Microsoft-Windows-WER-SystemErrorReporting,1001,None,"The computer has rebooted from a bugcheck. The bugcheck was: 0x0000001a...
Error,9/30/2022 1:22:59 AM,Microsoft-Windows-WER-SystemErrorReporting,1001,None,"The computer has rebooted from a bugcheck. The bugcheck was: 0x0000000a...
Error,9/27/2022 7:30:38 PM,Microsoft-Windows-WER-SystemErrorReporting,1001,None,"The computer has rebooted from a bugcheck. The bugcheck was: 0x0000003b...
Error,9/27/2022 3:32:11 PM,Microsoft-Windows-WER-SystemErrorReporting,1001,None,"The computer has rebooted from a bugcheck. The bugcheck was: 0x0000000a...
However, I do not see more clues (no errors prior to reboot), besides "The previous system shutdown... was unexpected".

I also ran DxDiag and the last section "Diagnostics" showed:
- MoAppCrash for Microsoft.PowerShell_7.2.7.0_x64__8wekyb3d8bbwe
- crashpad_log for MicrosoftEdgeUpdate.exe
- ScriptedDiagFailure for Microsoft Windows.NetworkDiagnostics.4.0

Surprisingly, sfc /scannow found an issue and repaired it:
2022-10-30 16:06:40, Info DEPLOY [Pnp] Corrupt file: C:\Windows\System32\drivers\bthmodem.sys
2022-10-30 16:06:40, Info DEPLOY [Pnp] Repaired file: C:\Windows\System32\drivers\bthmodem.sys
Not sure if that has to do with me manually installed the latest Intel Wireless Bluetooth driver.
Right now the WiFi driver shows 22.40.0.7, while my installation was 22.170.0.2

With OCCT, I did 10 minutes of each: CPU / Linpack / Memory / 3D Standard / VRAM / Power
Nothing went wrong.

I'll continue to check out other tips, especially around RAM.

 

FYI Only: Went to Microsoft Bug Check site:

Bug Check:

xA - The IRQL_NOT_LESS_OR_EQUAL bug check has a value of 0x0000000A. This indicates that Microsoft Windows or a kernel-mode driver accessed paged memory at an invalid address while at a raised interrupt request level (IRQL). This is typically the result of either a bad pointer or a pageability problem.

x3B - he SYSTEM_SERVICE_EXCEPTION bug check has a value of 0x0000003B. This indicates that an exception happened while executing a routine that transitions from non-privileged code to privileged code.

x7F -The UNEXPECTED_KERNEL_MODE_TRAP bug check has a value of 0x0000007F. This bug check indicates that the Intel CPU generated a trap and the kernel failed to catch this trap.

This trap could be a bound trap (a trap the kernel is not permitted to catch) or a double fault (a fault that occurred while processing an earlier fault, which always results in a system failure).

x50 - The PAGE_FAULT_IN_NONPAGED_AREA bug check has a value of 0x00000050. This indicates that invalid system memory has been referenced. Typically the memory address is wrong or the memory address is pointing at freed memory.

x1A- The MEMORY_MANAGEMENT bug check has a value of 0x0000001A. This indicates that a severe memory management error occurred.

Try booting into a Clean Windows Desktop to see if the BSODs are caused by a 3rd party Startup program.

It is easy to do and undo. Here is how to do it: How to perform a clean boot in Windows 

BT-22
Adept I

Almost no crash for 5 days, but when I was typing on Slack today, I got the 1st BOSD after reinstalling Windows.

Crash dump file: C:\Windows\Minidump\110422-20937-01.dmp (Minidump)
Bugcheck code: 0x3B(0xC0000005, 0xFFFFF333A18362E8, 0xFFFF85849E82FC30, 0x0)
Bugcheck name:SYSTEM_SERVICE_EXCEPTION
Driver or module in which error occurred: win32kbase.sys (win32kbase+0x362E8)
File path:win32kbase.sys
Description: Base Win32k Kernel Driver
Product: Microsoft® Windows® Operating System
Company: Microsoft Corporation
Bug check description:This indicates that an exception happened while executing a routine that transitions from non-privileged code to privileged code.
Analysis:This is a typical software problem. Most likely this is caused by a bug in a driver. Since there is no other responsible driver detected, it is suggested that you look for an updated driver for your graphics hardware. It's also possible that your graphics hardware was non-functional or overheated.

 

I do have the latest video card driver and there shouldn't be overheating issue.

Over this weekend I'll checkout various tips I've got here.

BT-22
Adept I

Just want to give an update.

I followed @Deemon69 's suggestion to re-flush the BIOS and it seemed like solving some BIOS related issue - previously if I do a shutdown, usually it lost BIOS settings. Now it doesn't. 

However, the BSOD continues. Even during an attempt to install a fresh new Windows 11, during the installation process, BSODs happened 2 times. When I had a successful installation, with minimal software, I also got BSODs.

I followed many suggestions from @elstaci and in the end, there are a few error codes, but nothing indicating a particular driver or software. So that got me thinking that it must be RAM chips.

Eventually I realized that I have 4 RAM sticks, but I bought them in 2 orders. One from Amazon, the other from NewEgg a couple months later when I decided to 2x my memory. Later I found while they're of the same brand, same model and same spec - I'd expect everything identical - they're not of the same manufacturer:

  1. 2 sticks from 1 order are made by SK Hynix
  2. 2 sticks from another order are made by Micron Technology

They also have different firmware versions printed on the surface of the stick.

So, when I had the first 2 sticks, I followed the instruction to put them in the 2nd and 4th slots.
However, when I got the other 2 sticks later, I simply put them in the 2 empty slots.

Then I finally read that it's best to put the sticks from the same manufacturer in the same channel.

So, that got me to rearrange the order, basically 2 sticks from 1 order in slot 1 and 2, then the other 2 sticks from the other order in slot 3, and 4.

I have not got any BSOD so far for 15 days. And so far, Windows has been up for 9 days since last boot.

I hope the issue has been addressed, but I'll probably draw the conclusion if by the end of this month/year I still have no BSOD.

I disabled Windows Update to avoid a reboot this month/year; also, I'm recording what software I installed/updated on which day and give it few days before installing/updating something else, just to keep track in case that helps. 

I want to thank you both @elstaci and @Deemon69 for your help! I'll give another update later this month.

0 Likes

It is best if you are going to add RAM to purchase a RAM kit with the desired amount of RAM Sticks that you need to prevent incompatibility issues.

Especially if they are from different RAM Manufacturers. I bet if you just use 2 sticks from the same manufacturer you won't have any more issues.

But good troubleshooting in finding out that your RAM sticks are not the same.