cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

zoso
Adept I

Threadripper 2990WX hardlocking - how to proceed

Issue: My new TR2990WX build (specs below) is hard locking after a few hours of load: The screen shows a "No signal" message, sound stops, the power button becomes unresponsive, but fans are at 100%. Only way to reboot is to switch of the PSU.

Reproduction: Today I found a way to reproduce it within a couple of minutes: Prime95 torture test will do, both the blended and the small FFT (which doesn't involve RAM , but runs from L2 exclusively) take 5-10 minutes before my machine hardlocks.

Things it's not:

  • A (simple) thermal problem: All temps are below 60°C at all times
  • The PSU. As one of the first steps I swapped out my Corsair AX860i for a AX1600i. Didn't change a thing.
  • The RAM. Multiple full passes of Memtes86 and the "small FFT" Prime95 both indicate RAM is not involved.
  • AISuite 3. Even after a fresh "clean" install of Windows the problem perists
  • A fault Windows install. I did clean Windows install (no drives except system, no internet, CMOS cleared, BIOS flashed back, as few USB devices as possible etc.) without success
  • The BIOS version
  • The driver versions (all up to date) - that's not to say it isn't a driver issue, of course
  • OC: Everything has been on stock settings all along
  • The GPU. This one in not 100%, but the "Prime95 Small FFT" reproduction.

Question: How do I proceed? When I got in touch with ASUS support (before I found a way to reproduce the issue so nicely) they narrowed it down to three things:

1. A faulty Windows install (which I eliminated as a possible cause, see above)

2. The Motherboard

3. The CPU

I think there might be other causes, but I agree the two remaining ones are the most likely ones.

Hiow do I proceed? I do have a spare GPU, but no MB or CPU.

Does the "Prime95 Small FFT" failure point to a faulty CPU?

Hardware:

CPU: Threadripper 2990WX

GPU: ASUS ROG STRIX 2080 OC 8GB

MB: ASUS ROG Zenith Extreme Alpha

CPU Cooler: bequiet! Dark Rock Pro TR4

SSDs/HDs: Intel Optane 900P 480G (System), Samsung EVO 970 2T, 2 more SATA SSD, 2 WD 4T SATA HDs

RAM: 128GB Corsair CMK128GX4M8A2666C16 (on the QVL List)

Case: bequiet! DarkBase 700 with 4 SilentWings 3 high-speed fans

PSU: Corsair AX1600i (replaced a AX860i)

0 Likes
1 Solution

I think I resolved my issue (the no-post issue, and I also think the stability issue(, and I wanted to give an update for anyone in a similar situation.

I was about to RMA my board when I noticed that it had been bent - and quite a bit - by resting on some rubber cable passages that are part of my case - while it works for E-ATX, it's really build with ATX boards in mind -, so there was no point in RMAing that board. I got a replacement anyhow, removed the rubber pieces that had gotten in the way - and now I got a working system that just ran Prime95 Small FFT for 20 minutes straight before I stopped it. While that's no proof my problem is solved, it's more than 3 times longer than I ever managed before and the first time I stopped it manually, so I's say it's a strong indication, especially together with the obvious physical damage to the original board.

Thanks again to everyone for their input!

View solution in original post

0 Likes
11 Replies
misterj
Big Boss

zoso, I have not read yet but will.  Please post a screenshot of Ryzen Master (RM) - simply drag-n-drop the image into your reply.  I have a 2990WX and it runs great!  Thanks, be back soon, and enjoy, John.

EDIT: I am back.  Please go to Event Viewer-Windows Logs-System and Filter Current-log... for Critical errors and let us see what you see.  I really need to see the screenshot of RM.  Are you running RAID(s)?  Are you using any of the ASUSs' silly enhancer OCer or XMP replacement?  I ran Prime95 the other day until the system started throttling and it ran fine.  Please look in your BIOS - CBS - Zen Common Options - Power Supply Idle Control and set it to Typical Idle Control and retest.   Here are my specifications:

MSI X399 Creation, Threadripper 2990WX, 3xSamsung SSD 970 EVO RAID0, 4xSSD 960 EVO on
MSI AeroXpander RAID10, 1TB & 500 GB WD Black, G.SKILL Flare X F4-3200C14Q-32GFX,
Windows 10 x64 Pro, EnerMax-MaxTytan-EDT1250EWT, Enermx Liqtech TR4 280 CPU Cooler,
Radeon RX580, Aquantia 10 GbpS Ethernet NIC, UEFI E7B92AMS.120, AGESA SummitPI-SP3r2-
1.1.0.2.

I went to water cooling several generations ago and have never looked back.  Thanks and enjoy, John.

0 Likes

misterj, thanks! Screenshot below:pastedImage_1.png

Meanwhile, if come to think of two more things to try:

1. Re-seating the CPU

2. Looking for bent socket pins in the process

If neither helps, I think either the CPU or MB are faulty, but I'm still not sure which.

0 Likes

misterj, Screenshot of RyzenMaster was in my original reply. I'm not in a position to post additional screenshots, as unfortunately, after re-seating the CPU my system doesn't post any more (Q-Code "Code 00 - Detecting Memory"), so there seems to be something wrong with either the board or the CPU. 

Before that, there was nothing in the Eventlog around the time of of hardlocks, just the "Critical" message for am unexpected shotdown (Event ID 41 I believe). There were, however, ACPI 15 (Event ID 56) from AppliactionPopup in groups of three after each system start - I thinks that's indicative of the BIOS, Drivers, and Windows not being in sync.

0 Likes

zoso, this forum has a serious clock problem and I did not see your reply about re-seating your processor before today.  I would have suggested that you NOT!  Now you have a real problem.  I would really like to have seen the ID41 errors.  Did it look like this?

Log Name: System
Source: Microsoft-Windows-Kernel-Power
Date: 5/2/2019 12:27:18 PM
Event ID: 41
Task Category: (63)
Level: Critical
Keywords: (70368744177664),(2)
User: SYSTEM
Computer: xxxxxxxxxxxxxxx
Description:
The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-Kernel-Power" Guid="{331c3b3a-2005-44c2-ac5e-77220c37d6b4}" />
<EventID>41</EventID>
<Version>6</Version>
<Level>1</Level>
<Task>63</Task>
<Opcode>0</Opcode>
<Keywords>0x8000400000000002</Keywords>
<TimeCreated SystemTime="2019-05-02T17:27:18.212709700Z" />
<EventRecordID>3904</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="8" />
<Channel>System</Channel>
<Computer>xxxxxxxxxxxx</Computer>
<Security UserID="xxxxxxxxx" />
</System>
<EventData>
<Data Name="BugcheckCode">26</Data>
<Data Name="BugcheckParameter1">0x61941</Data>
<Data Name="BugcheckParameter2">0x0</Data>
<Data Name="BugcheckParameter3">0x0</Data>
<Data Name="BugcheckParameter4">0x0</Data>
<Data Name="SleepInProgress">0</Data>
<Data Name="PowerButtonTimestamp">0</Data>
<Data Name="BootAppStatus">0</Data>
<Data Name="Checkpoint">41</Data>
<Data Name="ConnectedStandbyInProgress">false</Data>
<Data Name="SystemSleepTransitionsToOn">1</Data>
<Data Name="CsEntryScenarioInstanceId">0</Data>
<Data Name="BugcheckInfoFromEFI">true</Data>
<Data Name="CheckpointStatus">0</Data>
</EventData>
</Event>

I was getting this until I set 'Typical Idle Control'.  Did you at least find the 'Power Supply Idle Control' in your BIOS.  I am using a modified BIOS to see it and it has stopped all Critical errors.  You should get none. 

Everybody has a different opinion about required power.  I do not think 860 Watts is nearly enough especially with 128 GB memory.  I have only 32 GB and use a 1250 Watt supply.  If nothing else it limits your future use.

I have no good ideas for what to do now.  I guess you could call ASUS and see if they will have some sympathy for you.  If you want to try to debug it, you should start by removing your processor and look for bent pins and TIM where it should not be.

Please let us know if you get it to POST.  I don't think you need to change your RAM speed/timings.  Unless ASUS has messed with your RAM parameters it should be fine at SPD.  The next time you are running 'full load' please post a screenshot of RM.  Thanks and enjoy, John.

0 Likes

Thanks for the advice, even if it reaches me too late. It can't get it to post anymore. I looked for bent pins, but it's all as it should be. There was some excess TIM on top, bot nothing inside. And yes, the critical messages looked exactly like this.

Got in touch with Asus support, and they advised to RMA the board first, but I'll send in the CPU as well not to waste any more time. If it turns out to be ok, it's going to cost me a small service fee. If it turns out not to, it'll have saved me a lot of time.

0 Likes

Thanks, zoso.  Was the Stop Code in the Critical error 0x026?  Did your BIOS have  'Power Supply Idle Control' and 'Typical Idle Control'?  This may be the solution to your original problem.  It sure solved mine.  Thanks and enjoy, John.

0 Likes

I think I resolved my issue (the no-post issue, and I also think the stability issue(, and I wanted to give an update for anyone in a similar situation.

I was about to RMA my board when I noticed that it had been bent - and quite a bit - by resting on some rubber cable passages that are part of my case - while it works for E-ATX, it's really build with ATX boards in mind -, so there was no point in RMAing that board. I got a replacement anyhow, removed the rubber pieces that had gotten in the way - and now I got a working system that just ran Prime95 Small FFT for 20 minutes straight before I stopped it. While that's no proof my problem is solved, it's more than 3 times longer than I ever managed before and the first time I stopped it manually, so I's say it's a strong indication, especially together with the obvious physical damage to the original board.

Thanks again to everyone for their input!

0 Likes

Thanks so much, zoso.  I hope it holds up.  Please let us hear.  Enjoy, John.

0 Likes

I have an AX860i and it can easily handle the hardware you have.

First thing to do is check for a BIOS update. Then next up is to check the RAM timing to be sure its conservative. I suggest using a lower speed to be sure that the CPU can handle it easily.

0 Likes

hardcoregames:

I mainly swapped out the PSU because I suspected it to be faulty and wanted to verify, so currently the AX1600i is part of the build - but that might change later.

I'm not OCing anything, so my RAM sits @2133MHz - RAM temps are in the low 40s under full load, and it freezes in Prime95 "Small FFT" which does not involve RAM. So once it posts again - provided it does - I might try something even slower, but I don't really expect it to change much.

0 Likes

zoso wrote:

hardcoregames:

I mainly swapped out the PSU because I suspected it to be faulty and wanted to verify, so currently the AX1600i is part of the build - but that might change later.

I'm not OCing anything, so my RAM sits @2133MHz - RAM temps are in the low 40s under full load, and it freezes in Prime95 "Small FFT" which does not involve RAM. So once it posts again - provided it does - I might try something even slower, but I don't really expect it to change much.

If prime95 is choking, try backing off on your CPU clock multiplier a tad and see if that stabilises it better

0 Likes