cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

liamretrams
Adept II

Threadripper 2990WX Crashes

Hi,


I've got 8 x machines with this spec:

  • ThreadRipper 2990WX + DeepCool 240mm CPU cooler (4 x hi-speed fans on the radiator)
  • MSI X399 SLI Plus Motherboard  (Bios A.70, latest at time of writing)
  • G.Skill Aegis 4 Memory 64GB (4 x 64GB Modules) - F4-3000-C16-16GISB
  • nVidia GeForce 1030 GPU - Driver 25.21.41.1881 (Latest at time of writing)
  • Sandisk SSDA480G SSD (480GB 2.5" SATA) - Latest Firmware
  • FSP Hydro G 850W Power Supply
  • Win 10 Pro x64 Build 1809

These machines are used for rendering 3D visualization simulations. This is a CPU heavy task. The software we use is 3D Studio Max with a range of plugins and add-ons (v-ray, corona etc)

 

Unfortunately as soon as the render kicks off, some of the 8 machines will crash withing 10-15 minutes. If I wait long enough (48 hours), the rest of the machines will crash also. None of the machines are stable, like the Intel i7 4770Ks they replaced - those would run for weeks at 100% CPU happily.

When I say "Crash", I mean the screen goes black, no image to screen. Keyboard numlock doesn't work. Even the power or reset buttons don't work - I have to physically remove the power cord and re-plug it in. There is no bluescreen or anything. It just stops working. The event log simply shows that the previous shutdown at <time> was not unexpected.

Things I have tried so far:

- Latest drivers, bioses and firmware versions for the SSD

- Reinstalled windows & drivers

- Run sfc and dism to check that the system files aren't damaged

- Checked temperatures to ensure the CPU/GPU arent overheating (it isn't).

- Run synthetic benchmarks like IntelBurnTest and Memtest86.

- Physically relocated the affected machines to another office to see if this is an environmental issue (didn't help).

During synthetic benchmarking the system will sometimes crash, other times remain stable for 48+ hours.

What is going on here? All 8 machines are doing this. 

0 Likes
136 Replies

I am currently having crashing problems with my 2990wx which I haven’t figured out yet. But, originally I had 2 sets of 8gb x 4 sticks in my computer. Using all 8 dims totaling 64 gigs of ram.  I found that on my motherboard QVL list, my ram is there but it says that it doesn’t support 8 dims with my ram.  I took one of my sets out and am now using 4 sticks of 8gb which the QVL list says that it supports and it’s still crashing.  Your system seems to have problems related to how many dims you use? Besides your ram not being on any QVL lists. It’s a little weird that your ram is only sold as single sticks. I don’t have any proof but I agree with other people that have posted that you might want to try a set of ram on the QVL list.. and I’m interested to see the results

0 Likes

I have a 8 denim slots all filled up with 64 GB of RAM. Did you run windows 10 memory test Routine. There are a couple links with information on different evaluation tools. On my other thread which I posted. I would suggest you get one of the monitoring software and have it track all your system and resources and then when a crash occurs you can look through the data and see if there’s any drop a problem within the hardware itself. The processor is power-hungry

Sent from my iPhone

0 Likes
liamretrams
Adept II

It has now been several weeks, running with only 2 sticks of ram instead of 4 while I wait for the QVL 64G kit to be delivered. Systems are stable - we're happy.

I didn't realize this system was so picky with memory. Took forever to get to the bottom of it because I thought this was a hardware / software fault.

It was not the PSU

0 Likes

liamretrams,


 

It was not the PSU

What makes you say this?  I have seen no evidence.  Have you tried a 1000 Watt supply?  Running with half the memory or half the cores significantly reduces the load on the power supply.  If the evidence is buried up above somewhere, I apologize.  Enjoy, John.

0 Likes

I have got a similar setup. Crashes very often but not every time when starting to render with 32 cores enabled. But runs completely stable with downcore 24 cores activated.

MSI support said it is probably the PSU and memory would not matter if it's correctly recognized. So changed the PSU from Seasonic 850W to EVGA 1000W. Still crashing. So I can confirm that it is probably not the PSU.

Please let me know if your QVL mermory kit is working. Thank you.

ThreadRipper 2990WX

MSI X399 SLI Plus, latest BIOS vA7

Geforce RTX2080

GSkill 2x F4-3600C19D-32GSXKB (4x 16GB modules)

Windows 10 Pro 1809

0 Likes

revis3d, I strongly urge you to open another thread for your problem, which may be quite different.  I almost did not find your reply because it showed up far from the bottom of the thread.  It seems that all the forums I use have some strange feature that makes it really hard to use.  Please post a screenshot of Ryzen Master (R) under heavy load - simply drag-n-drop the image into your reply.  I also agreed with the MSI that your power supply was too low power - probably OK now (I will look at RM).  Please tell us more about your setup - CPU cooler, version of W10, video card, drives, AGESA version.  Is your memory on the QVL?  Here are my system specifications:

MSI X399 Creation, Threadripper 2990WX, 3xSamsung SSD 970 EVO RAID0, 4xSSD 960 EVO on
MSI AeroXpander RAID10, 1TB & 500 GB WD Black, G.SKILL Flare X F4-3200C14Q-32GFX,
Windows 10 x64 Pro, EnerMax-MaxTytan-EDT1250EWT, Enermx Liqtech TR4 280 CPU Cooler,
Radeon RX580, Aquantia 10 GbpS Ethernet NIC, UEFI E7B92AMS.120, AGESA SummitPI-SP3r2-
1.1.0.2.  Thanks and enjoy, John.

0 Likes

Misterj, memory is not on the QVL. Support said it does not matter if it is reckognized.

Here is a screenshot under load (CPU-Z stress test). It never crashes while stressing in CPU-Z.

pastedImage_1.png

Specs are posted above. I completed the list:

ThreadRipper 2990WX

MSI X399 SLI Plus, latest BIOS 7B09vA7 - Update AGESA Code 1.1.0.2

GAINWARD Geforce RTX2080 (Nvidia)

GSkill 2x F4-3600C19D-32GSXKB (4x 16GB modules)

Windows 10 Pro version 1809

CPU Cooler Enermax Liqtech TR4 240

PSU EVGA GQ 1000W Gold PS

3x Samsung SSD 970 EVO 500GB, M.2 (MZ-V7E500BW)

2x Seagate BarraCuda Compute 4TB, 3.5", 256MB, SATA 6Gb/s (ST4000DM004) 

0 Likes

Here is another screenshot in the situation when it often crashes:

pastedImage_2.png

0 Likes

Sorry, revis3d, I missed your response because it shows up in the middle of a long thread.  Right now the only problem I can see is with the memory.  It is not on the QVL and it is not a quad kit.  TR should have a quad memory kit not 2x dual kits.  G.Skill has always been very fair with me.  You might see if you can swap them your 2 kits for 1 (Quad) of memory like mine, for example.  It is on the QVL, contains Samsumg B-Dies and runs XMP-2 (3200 MHz) no sweat.  I am not stuck on my memory and there are many great ones, but I do think Ryzen and Samsung B-Dies get along really well.  Does your video card pull lots of power?  Is there any difference if you change Dynamic Local Mode?  I assume you are aware you could run faster if you increase PPT (VOIDS Warranty) and TDC (VOIDS Warranty).  Please let me know if you want to talk about it.  I do not work for AMD.  Thanks and enjoy, John.

0 Likes

You should probably make your own thread. It would be easier to debug.  I don't think it's a ram problem because it doesn't crash with less cores. Can you describe the crashing?  Is it only apps that hang? black screen? blue screen? any errors? 

0 Likes

cgorange, I am curious why you are responding to me?  Interesting!  I say it looks like a memory problem and you say:

I don't think it's a ram problem because it doesn't crash with less cores.

What makes you think running less cores causes crashes?  Please provide a reference.  Maybe we can get to the bottom of this idea.

Thanks and enjoy, John.

0 Likes

He said, "But runs completely stable with downcore 24 cores activated."

So I mean if it's running fine with less cores it's probably not a ram compatibility problem. Possible the oc speed problem. It would probably still crash if it was. 

0 Likes

cgorange, that could mean many things.  The OP said nothing about OCing.  This makes no sense to me:

So I mean if it's running fine with less cores it's probably not a ram compatibility problem.

Running less cores will reduce the load on the VRM (only one for all cores), reduce load on power supply.  Thanks and enjoy, John.

0 Likes

There are only two G-SKILL Ram Memory that was tested to be compatible by MSI for the 2990WX: Support For X399 SLI PLUS | Motherboard - The world leader in motherboard design | MSI Global The 2400 MHz RAM is Dual whereas the 3200 MHz is Quad.  You have a 50/50 Chance your Ram is compatible with the Ryzen and motherboard. Your Ram may not be listed in the QVL list because it wasn't compatible or it wasn't Tested for compatibility.

Try manually lowering the RAM Speed in BIOS and see if it continues to crash with all cores enabled.

Note: The highest RAM Speed seen in the QVL list is 3200 MHz. You posted your RAM Speed is 3600 MHz by your Part Number.

EDIT: Even G-SKILL doesn't list your RAM for your MSI Motherboard using G-SKILL RAM CONFIGURATOR: G.SKILL - RAM Configurator . The highest RAM Speed listed is 2933 MHz. (Partial list from RAM CONFIGURATOR):

0 Likes

I'd do what elstaci says and maybe go down to 1 stick of ram but you may have to buy another ram set that's on the QVL list and swap it out to really test it. 3600 Is high for non QVL ram.. deff go down to the defaults un OC speeds.

cgorange
Adept II

I had a lot of crashing with my 2990wx build.. I thought it was my ram. thought it was my mobo. thought it was the psu.. turns out 99% chance it was the software CAM I used on my water cooler. After uninstalling that boom. everything is great. So..... never know.. I even reinstalled my 8 sticks of ram of 2 different sets and its working great.  1 set is on the QVL list but 2 isn't. I'm guessing because they didn't test running 2 different sets. 

If you get the new ram and put it in and still have problems I would. reinstall windows and try with minimum software.

I am tempted to agree it could be the psu but.. my comp says it only uses 500w while rendering and I have a more power hungry gpu then you (2080ti) so I feel like 850w psu should be enough. 

0 Likes

Ryzen used to be much more sensitive to memory.  When TR first came out supporting 128 GB, no one could get it to even boot.  AMD has worked on its memory controller code and it is much better now.  All users should be sure to run the latest BIOS containing the very latest AGESA.  Right now I am running AGESA SummitPI-SP3r2-1.1.0.2 which is the latest offered by MSI.  Enjoy, John.

EDIT: CPU-Z will reveal the BIOS and AGESA versions installed.

0 Likes

My gigabyte motherboard has only one version of their bios... which came out in Oct last year.. think that has it?  

0 Likes

I use MSI and my M350M Bazooka has something like 15 ROMs since it came out

Their Live Update tool finds drivers and BIOS updates automatically

Surprised that Gigabyte does not do that

0 Likes

Yeah my motherboard the Gigabyte Aorus Pro x399 doesn't have much info anywhere and not many real reviews.. It's almost like it doesn't exist.. So I'm not surprised that it's not getting updates. But I do wonder. 

0 Likes

Your Motherboard is only 6 months old which probably explains the lack of information.  BIOS version "F1" is the original BIOS, as you mentioned, that came out on 10/01/2018 which means that is when the Motherboard was sold to the public. So most likely, the BIOS has all the updates fairly current, at least from 6 months ago.

I imagine Gigabyte will come out with newer BIOS versions as more CPUs and hardwares comes out or when AMD updates the Ryzen microcodes or AGESA.

0 Likes

cgorange, please post your AGESA version.  Thanks and enjoy, John.

0 Likes

Where can you see that info?

0 Likes

cgorange, in CPU-Z under Mainboard.  Please see my post several above.  Thanks and enjoy, John.

0 Likes
liamretrams
Adept II

Ok. This is still ongoing.


In a nutshell, we've narrowed it down to the system working ok-ish (but still crashing, however usable for our workloads) with 32GB RAM but will not work stable with 64GB RAM.

I acquired the F4-2400C15Q-64GVR which is a 64GB kit that is listed on the Asus website as compatible for ThreadRipper with this board. This has had the effect of making the system even more unstable.

 

I am going to change the motherboard.

Yeah, I didn't think it was the ram since it was stable when using less cores. It's worth testing the mobo but the mobo says its compatible with the cpu and since you have 8 machines running. I doubt you got 8 bad mobos. All my instability and crashing was because of bad Kraken software for my water cooler. As crazy as that sounds. Once I uninstalled the software my problems were fixed.  

0 Likes

I wonder how well my cheap G.Skill NT series RAM would work

I bought this due to limited funds, turns out it was a good deal

0 Likes

liamretrams, what is your cooling solution?

Thank you for posting, I will not buy another set of RAM.

0 Likes

 DeepCool 240mm CPU cooler (4 x hi-speed fans on the radiator)

0 Likes

liamretrams wrote:

 DeepCool 240mm CPU cooler (4 x hi-speed fans on the radiator)

that should easily keep your processor cool as a cucumber

0 Likes
ad_ws_tx
Adept I

Hi.

I am curious to know if you ever managed to get this issue resolved.  One thing that I noticed is that you have Nvidia 1030 video cards in your build.  I am curious to know if you have them hooked up to monitors via displayport.

In troubleshooting a problem that just popped up on my build, I came across a few threads that indicated that some NVIDIA cards have issues correctly dealing with sleep and wake over displayport.  An NVIDIA app exists to check for this issue:

https://www.nvidia.com/object/nv-uefi-update-x64.html

Note that the NVIDIA fix actually requires a firmware flash to the card, not an OS driver update.

Anyway, the reason I bring this up is that my Threadripper system is also used on long (hours to days) processing loads.  It's worked like a charm until recently when it started exhibiting random crashes and hangs.  After troubleshooting a bunch of things I had to come back to "what changed recently"?  And what changed was that I moved my desk, and switched from an HDMI cable from an NVIDIA 1050 to a displayport cable.  Then the crashes started happening.  Sometimes after a few hours.  Sometimes after more than 24 hours.

Anyway, I switched back to HDMI and I'm waiting to see if it helps.  Since I was prowling forums looking for clues I thought I might pass this on.

Best,

-A

0 Likes

Reading the notes on the firmware page it sounds like once the os is booted there isn't any crashes. So it's more a problem with booting.  These other crashes are when your computer is already booted so in my opinion it sounds like different problems.

But that is really interesting.  If you go a week or two with out any problems I would plug the display port back in and then if it happens then you really know if its the display port.  I'm very interested in the results. 

0 Likes

Just an update on my issue: switching back to HDMI resolved the crashes.  While the release notes on the NVIDIA firmware update mention failure to boot the problem that I encountered was actually during long runs.  My projects often run for many hours or days, so the monitor normally goes to sleep.  Periodically I jiggle the mouse to wake the display up and check on the progress.  I have updated the firmware on the NVIDIA card and will now retry the displayport.

The relevance to this thread is that my motherboard ezdiag light had indicated "CPU Fail" so it originally appeared to be a CPU related crash.  I am sure there are many affected NVIDA cards out there so others may experience this problem as well.

0 Likes

Did the firmware update of your NVIDIA card turn out well?

0 Likes

Hello revis, 

I don't know for the firmware.... I just know that drivers are update. Can u tell me what i have to do to see if the firmware is update? 

Best

Damien

0 Likes

It seems to have.  I have not yet done an extended test with the displayport connection after doing the upgrade, but here is where we stand right now:

Workload - high CPU utilization simulations that run for 24 hours, 16 cores:

[Prior to Nvidia firmware update]

HDMI connection - no issues

Displayport w/ Monitor that supports DP 1.3/14 - crashes after several hours, motherboard indicates CPU failure

HDMI connection - no issues

[Post firmware update]

HDMI connection - no issues

Displayport - testing now.

Just a note that my issue is probably different than liamretrams and brozios67.  I posted here because I was also looking around desperately for help when I stumbled on the NVIDIA / displayport issue.  This is not a Threadripper specific problem.  But because the motherboard (MSI X399 Carbon AC) diagnostics *said* it was a CPU problem I initially went down that troubleshooting path.  It is an unexpected failure mechanism and therefore hopefully something that is useful for others to check.  My failure mode is different than the one described in the NVIDIA release notes but appears to have the same or similar root cause.  Also, again remember that is actually a video card *firmware* updated, not an operating system driver update.

Graphics Firmware Update for DisplayPort 1.3 and 1.4 Displays | NVIDIA 

0 Likes

Those Firmware notes only talk of not displaying on screen before boot.  I suspect because the Nvidia drivers take over after boot.  So it sounds like a different problem then causing crashing while booted.... but it’s deff interesting.  

0 Likes

Final update on my issue (crashes during long processing runs):

[Prior to Nvidia firmware update]

HDMI connection - no issues

Displayport w/ Monitor that supports DP 1.3/14 - crashes after several hours, motherboard indicates CPU failure

HDMI connection - no issues

 

[Post firmware update]

HDMI connection - no issues

Displayport - no issues

To me it appears to be a sleep/wake issue with the monitor over the newer displayport protocols.  During my runs the monitor falls asleep and I periodically jiggle the mouse every couple of hours to check on things.

Anyway, issue resolved and my machine is back to running simulations that span multiple days without issue.

Best,

0 Likes

EVGA has a firmware update posted but none of my cards needed it

0 Likes
brozio67
Journeyman III

Hi liamretrams

Do you find out what's going wrong with your crashes? I have exactly the same problems ( i only have one computer). 

My computer crash randomly when i render with 3ds max 19 and corona render.

My spec are: 

_  MSI X399 SLI PLUS

_samsung serie 860 EVO 500GO

_ Antec High current gamer750W Bronze

_Corsair Hydro Series - H80i v2

_G.skill Aegis DDR4 8x16 GB 3000MHz CAS 16

_GTX 2070 Twin X2

_Amd ryzen 2990 WX

_Microsoft Windows 10 Family

I really need help!!!

Best, 

DAmien

0 Likes