cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

liamretrams
Adept II

Threadripper 2990WX Crashes

Hi,


I've got 8 x machines with this spec:

  • ThreadRipper 2990WX + DeepCool 240mm CPU cooler (4 x hi-speed fans on the radiator)
  • MSI X399 SLI Plus Motherboard  (Bios A.70, latest at time of writing)
  • G.Skill Aegis 4 Memory 64GB (4 x 64GB Modules) - F4-3000-C16-16GISB
  • nVidia GeForce 1030 GPU - Driver 25.21.41.1881 (Latest at time of writing)
  • Sandisk SSDA480G SSD (480GB 2.5" SATA) - Latest Firmware
  • FSP Hydro G 850W Power Supply
  • Win 10 Pro x64 Build 1809

These machines are used for rendering 3D visualization simulations. This is a CPU heavy task. The software we use is 3D Studio Max with a range of plugins and add-ons (v-ray, corona etc)

 

Unfortunately as soon as the render kicks off, some of the 8 machines will crash withing 10-15 minutes. If I wait long enough (48 hours), the rest of the machines will crash also. None of the machines are stable, like the Intel i7 4770Ks they replaced - those would run for weeks at 100% CPU happily.

When I say "Crash", I mean the screen goes black, no image to screen. Keyboard numlock doesn't work. Even the power or reset buttons don't work - I have to physically remove the power cord and re-plug it in. There is no bluescreen or anything. It just stops working. The event log simply shows that the previous shutdown at <time> was not unexpected.

Things I have tried so far:

- Latest drivers, bioses and firmware versions for the SSD

- Reinstalled windows & drivers

- Run sfc and dism to check that the system files aren't damaged

- Checked temperatures to ensure the CPU/GPU arent overheating (it isn't).

- Run synthetic benchmarks like IntelBurnTest and Memtest86.

- Physically relocated the affected machines to another office to see if this is an environmental issue (didn't help).

During synthetic benchmarking the system will sometimes crash, other times remain stable for 48+ hours.

What is going on here? All 8 machines are doing this. 

0 Likes
136 Replies

No, not yet. I have tried everything. Also tried replacing the board with a X399 AORUS PRO-CF. Tried replacing GPU. SSD. PSU, RAM (even tried the RAM on the QVL).

Nothing as worked so far. We are considering just selling the board + CPU and replacing it with something else at this stage.

0 Likes

I think you should try swapping out your cpu cooler for a different one. A different model. I was having black screen crashing and it turns out it was the water cooler software that was doing it. Once I uninstalled the software it stopped crashing. 

0 Likes

Hello cgorange, 

I don't have other cpu cooler for the moment, can't try this...

One thing is strange, i can do a lot of test with CPU Burn (hardcore mode) during hours, nothing crash. With 3dsmax, 15 minutes of rendering and it crash.... I have other computers, everything works fine with 3dsmax (same version, same plugin). 

Best, 

Damien

0 Likes

You might want to stress test the cpu with different stress testing software.  If it isn’t crashing during those stress tests and its only crashing with Max renders. It’s not a problem with the cpu. I was getting crashing only while rendering in maya and the cause was the Nvidia driver.  Rolling back to an old driver fixed it. Until Nvidia fixed it in one of the later drivers so... you could try installing the latest Nvidia driver to see if that fixes it. 

0 Likes

Doesn't that mean the CPU is at fault? I have the same MB but different RAM and am experiencing the same problems...

hamsel
Journeyman III

Same Problem. Threadripper 2990WX (64GB 4x16GB) crashes after 10-15 minutes of rendering.

Rearranging the RAM did the trick.
I placed the RAM next to each other an didn't use the order shown in the mainboard manual (Slot 1-4 instead of Slot 1,3,6,8).

0 Likes

You prob aren’t getting dual/quad speed now that you aren’t using the other slots. You might want to check them by putting ram in each of the other slots only to see if your mobo just has a bad slot on the other side

0 Likes

hamsel wrote:

Same Problem. Threadripper 2990WX (64GB 4x16GB) crashes after 10-15 minutes of rendering.

 

Rearranging the RAM did the trick.
I placed the RAM next to each other an didn't use the order shown in the mainboard manual (Slot 1-4 instead of Slot 1,3,6,8).

I have seen the same discrepancy with my X470 board

We tried every possible combination of RAM, including RAM that is on the QVL for our boards. Tried it with 1 stick, tried it with 2, tried it with 4. Nothing works - all crashes. Changing CPU is the only thing that has addressed it.

Currently going through the process of getting all our machine CPUs swapped out.

0 Likes

I assume you have the latest BIOS installed

0 Likes
liamretrams
Adept II

Ok. I know I haven't updated this thread in a while, but I just wanted say this is fixed. Unfortunately, the fix was "downgrading" all 8 machines to 2950WX (from 2990WX) with ALL other components remaining identical. The systems now run at 100% CPU for weeks on end. Still have the occasional crash/freeze but its one every few weeks/months rather than 10-15 minutes from boot.

The only conclusion here is that there is something seriously wrong with the 2990WX chip / some weird interop compatibility between the parts we selected. There is zero chance that all 8 of them were faulty. Thankfully, our supplier has been very sympathetic and helped us a lot in getting to the bottom of this.

Obviously, the amount of money and time wasted on this is astronomical and the whole thing has been extremely unpleasant. I will be approaching AMD systems with extreme caution in the future.

Huge thanks to everyone in this thread that helped.

0 Likes

Glad to see you swapping out of the processors has resolved the stability and interoperability.

Your solution is rather eye opening. This suggests a further BIOS update is likely.

0 Likes
clsmithj
Adept II

I have the same Motherboard and CPU, since September 2018.  I did not do my research on X399 motherboards when I bought the SLI Plus,  it was the most affordable, and it was MSI which I like so I just ran with it.

In hindsight I should have did a little more research, because then I would have realized the best motherboard to get was the more pricier MSI X399 MEG Creation.   The MEG Creation has more 16+3 CPU+PCI VRM phases compared to the SLI Plus 10+3.  I read many reports claiim the high VRM phase count is what makes the MEG Creation more suitable for the 2990WX and 2970WX CPUs.

  

But despite that my 2990WX has been largely stable with the SLI Plus for over a year and half using the Precision Boost Overdrive in Ryzen Master.   

Only until this month  I've been pushing my Threadripper a little more to see how far I could get in overclocking performance out this CPU and budget SLI Plus motherboard.

Case: Thermaltake Core P5 TG Ti

The CPU is cooled with a Cooler Master MasterLiquid 240

64GB (16GB x 4) @ 3000Mhz Patriot Viper 4 DDR4 (CL16x18x18x18x36)

2x Radeon VEGA 56 GPUs.

EVGA 1200 P2 Supernova Platinum (1200W). 

Storage primary being a Samsung 960 Evo nvme followed by a Sandisk X600 SSD / 3 HDD totaling 12TB. 

 

I mounted a case 80mm 60cfm case fan behind the VRM stack, pointing airflow directly at it, which I found is a very crucial thing to do for these X399 boards for overclocking stability and have managed to get this system to overclock to a somewhat stable 4.0GHz at 1.25 - 1.288V.  

Now prior to overclocking I had no issues rendering.  I could run a Cinebench or Blender Gooseberry render with no issue, or run Handbrake or SONY VEGAS encoding with no issues.
But now that I'm at 4.0GHz overclock. I see the limits of my Cooler Master 240mm AIO radiator.  When I tried to run a Blender Goosberry render, the temps climbed beyond the 68c non-OC safety throttles and hit 74-75c before locking up.  

It also crashed in Windows when I tried to run Geekbench4 but I need to do more test with this.*

So now I'm going to upgrade my 240 AIO to a 360 AIO to see if I can keep this CPU to stay in the 60s for overclock stability.  


Another thing I want to add from my findings from working with 2990WX is the OS. I'm sure you all heard about the "Performance Regression" that plagued the WX of 2nd gen Threadripper used on Windows builds (which may have led AMD to completely alter this HEDT platform when it came to the TR 3000 series).

I dual boot Fedora 31 Linux on this rig, and my 2990WX runs way...WAY MORE STABLE in Linux, I don't think I got a single crash while booting in to Linux since overclocking compared to the random crashes I get occasionally while booted on the Windows side which ultimately drove me to increase the voltage in my OC to 1.288V in the BIOS

Fedora Linux on the other hand ran stable at my initial 1.25V voltage on my 4.0GHz overclock.

I also could run Geekbench 4 in Linux with no problem here and get a huge score that's usually 30-50% greater than what I got in Windows.   I have not ran Gooseberry - Blender yet in Linux since going to 4.0GHz overclock. Prior to the OC I did run Blender Gooseberry and had no issue.  I just don't want to much stress on my CPU until I get a 360mm radiator in place. 

So if I had to outline the key issues one might face with the 2990WX.  

1. Motherboard VRM stability and adequate cooling.  Keeping the VRMs cool is definitely a plus here for stability and overclocks.  The MEG Creation 16 phases will make this even simpler if you have that MB. 

2. High watt PSU (80Plus Gold+ preferred), I would not run pair this CPU with anything less than 1000W.  I use a 1200W 80Plus Platinum  I seen people online who paired it with a 1600W (overkill).  I monitor the watts used from the outlet using the PowerChute app that comes with my UPS, and this CPU eats up some watts when loaded, especially now that I am overclocking, plus the VEGA GPUs are power hungry too so you don't want to skimp on power supply.   

3. Operating System and it's scheduler.  Linux runs circles around Windows when it comes to the 2990WX.  The only working solution AMD left us with is the Ryzen Master's Dynamic Local Mode, to tackle Windows 10 poor scheduling.  I use Ryzen Master version 1.5.3.0902 (and I don't recommend anyone with a WX 2nd gen Threadripper upgrade the Ryzen Master app any higher than that version because the Dynamic Local Mode feature was removed in later versions with no explanation from AMD.  I can only assume it was because AMD moved on to TRX40 and TR 3000 which does not need DLM since the new Threadripper has been completely reworked internally. 

0 Likes

I use 80 plus platinum, HX1000i can power anything I want for gaming.

Windows scheduling is not very efficient

0 Likes

Hey, how did the better cooler turn out?  I’m thinking of trying to overclock my 2990wx to 4ghz. 

0 Likes
revis3d
Journeyman III

Exchanging the cooling solution from water cooling (Enermax Liqtech 240) to air cooling (Noctua NH-U14S) made my system stable. CPU temperature has dropped tremendously. I had system shutdowns at 100°C before.
Still need to test reactivating all cores again.

 

ThreadRipper 2990WX

MSI X399 SLI Plus, latest BIOS vA7

Geforce RTX2080

GSkill 2x F4-3600C19D-32GSXKB (4x 16GB modules)

Windows 10 Pro 1809

0 Likes
sujit
Journeyman III

Hi, I too have the exact same problem as angryphoton.

While rendering heavy scenes with 3dsmax and Vary,  the computer screen goes black and no other devices work like keyboard and even the power button or the reset button. To reboot the computer I switch off the main power and switch it on again. Once it is rebooted even for a small scene or it even goes off while opening a scene.

Its quite annoying after paying so much for the processor I am not able to render large scenes and rely on render farms and pay them.

0 Likes