Hi,
I've got 8 x machines with this spec:
These machines are used for rendering 3D visualization simulations. This is a CPU heavy task. The software we use is 3D Studio Max with a range of plugins and add-ons (v-ray, corona etc)
Unfortunately as soon as the render kicks off, some of the 8 machines will crash withing 10-15 minutes. If I wait long enough (48 hours), the rest of the machines will crash also. None of the machines are stable, like the Intel i7 4770Ks they replaced - those would run for weeks at 100% CPU happily.
When I say "Crash", I mean the screen goes black, no image to screen. Keyboard numlock doesn't work. Even the power or reset buttons don't work - I have to physically remove the power cord and re-plug it in. There is no bluescreen or anything. It just stops working. The event log simply shows that the previous shutdown at <time> was not unexpected.
Things I have tried so far:
- Latest drivers, bioses and firmware versions for the SSD
- Reinstalled windows & drivers
- Run sfc and dism to check that the system files aren't damaged
- Checked temperatures to ensure the CPU/GPU arent overheating (it isn't).
- Run synthetic benchmarks like IntelBurnTest and Memtest86.
- Physically relocated the affected machines to another office to see if this is an environmental issue (didn't help).
During synthetic benchmarking the system will sometimes crash, other times remain stable for 48+ hours.
What is going on here? All 8 machines are doing this.
liamretrams, I need to think on this a little, but at first blush, the common factor is the power supply. I have a 2990WX and have a 1250 Watt supply. How much memory do you have? I would suggest to get at least a 1000 Watt PS for one machine and see if it stops. Please post a screenshot of Ryzen Master (RM). Here are my specifications:
MSI X399 Creation, Threadripper 2990WX, 3xSamsung SSD 970 EVO RAID0, 4xSSD 960 EVO on
MSI AeroXpander RAID10, 1TB & 500 GB WD Black, G.SKILL Flare X F4-3200C14Q-32GFX,
Windows 10 x64 Pro, EnerMax-MaxTytan-EDT1250EWT, Enermx Liqtech TR4 280 CPU Cooler,
Radeon RX580, Aquantia 10 GbpS Ethernet NIC, UEFI E7B92AMS.120, AGESA SummitPI-SP3r2-
1.1.0.2.
Thanks and enjoy, John.
EDIT: 850 Watts is not nearly enough power and 1000 Watts may also not cut it but should at least reduce it. After you get one system stable, then consider what to do about all the others. I just now noticed you are running 64 GB memory. You said 4x64GB but I suspect you mean 4x16GB. If it were 8x8GB, I would suggest you remove half and see if that helps the hangs and does not spoil performance too much. What speed are you really running the memory, probably not 3000 MHz, I suspect. Does the cooler cold block cover all four chips in the module completely? Be sure to post a screenshot of RM. How much power does the video card use?
Yes, I typoed. It is 4 x 16GB (model # is F4-3000-C16-16GISB), total of 64GB. 4 modules, D2, B2, A2, B2 slots as recommended by the board. Current RAM Clock speed is set to "Auto", showing 1067 in Ryzen Master so 2133 I guess? Should I manually bump this up to DDR4-3000 in the bios?
Since all the processing work in this machine is done by the CPU (this is a compute box, not a gaming machine), the GPU is only a GT1030 - maximum of 50W. The motherboard has both CPU_PWR1 connectors plugged in. There is no cooling for the four chips (I presume you mean memory modules). I did try aiming a desk fan in there to see if it helps, but it didn't. The crashes don't appear to be load related. Sometimes, the crash will happen when the system is at idle, other times it will happen when the system is running a render job. But the most reliable way to make it crash seems to be to start a render job. Of the 8 machines, some will crash in as little as 10 minutes. Other times I've seen them go for up to 48 hours.
Finally, on closer inspection, I have noticed that my "PCIE_PWR1" connector on the motherboard are not connected. I see your motherboard also has one - do you have it hooked up? This could potentially be the issue?
The bottom line is none of my machines are stable unfortunately. I expect them to run for weeks at 100% load (as our Intel i9s do).
"Finally, on closer inspection, I have noticed that my "PCIE_PWR1" connector on the motherboard are not connected. I see your motherboard also has one - do you have it hooked up? This could potentially be the issue?"
Absolutely! You are talking about this right? BOTH need to be connected to a power supply with the appropriate supply outlets.
No, those are the CPU_PWR1 and CPU_PWR2. Both those are ocnnected on mine.
I am talking about the PCIE_PWR1, that is immediately to the left of the left most RAM slot and directly above the highest PCI-e socket.
Got it, looking at it now, you don't need it because you don't overclock. I use a MSI MEG Creation X399 so my board layout is different from yours.
Is your one hooked up? I see your board has this connector too, in addition to the CPU_PWR1 and CPU_PWR2 as well. Only, yours is on the bottom left.
No, I don't think you need to hook yours up.
Inspecting that picture I see a pair of EPS12V which my AX860i can cover. There is also a 6-pin PCI Express connector near the slot that probably needs more power.
The GTX 1030 is not a demanding video card. So my AX860i can handle this hardware easily.
Install a RX Vega 64 and my AX860i has more cables for that card and it still is not even breaking a sweat.
liamretrams, I tried to be subtle about it but I am almost sure you need a bigger power supply. Please get one PS of at least 1000 Watts. I have seen this power shortage more than once. Is your memory on the QVL list? Please leave it at SPD speed. When this is settled, you can try an XMP if you like. My PCIe_PWR1 is plugged but it is really not needed. It should be used for multiple GPU setups. All of the CPU power connectors should fully populated - 2x8 pin, not anything less. I was not talking about cooler for the memory but for the CPU. When I selected my CPU cooler, I was careful to select one that specifically said it covered all four chips in the processor module. Older coolers (prior to TR) do not and should be avoided. Please run CPU-Z and post a screehshot:
You can drag-n-drop the image here. I want to see your AGESA.
I strongly recommend a stronger PS. Thanks and enjoy, John.
Hi, I posted my CPU-Z info already. I will see about a larger PSU and QVL RAM. I'll attach it again to the end of ths post.
I distinctly remember when this happened, we went back to our hardware vendor. They gave me some QVL ram, but it did NOT address the crashing issue.
The problem is the fault is difficult to reproduce. Sometimes it will crash in 5 minutes just sitting on desktop, other times it will run for 48 hours without issue, and everything in between.
What is your CPU setup? Stock? Overclocked? Try disabling the dynamic local mode if you are using ryzen master.
Also does it crash using distributed rendering or back burner jobs. Most of the time crashes happen to me because of using so many plugins. Try rendering a scene with the least plugins, like vray only and then add little by little.
100% stock, we do NOT overclock. I found overclocking causes a lot of grief in the long term and stability is more important when you're working with deadlines.
Regarding when it crashes - sometimes it can render 2-3 days on and be fine. Other times it will crash 10 minutes into a render job. Seeing this behaviour across all machines.
The problem is general instability, not specifically any one thing that I can do that will make it crash.
Also checking your memory modules compatibility, I cannot find QVL approved listings for BOTH GSkill and MSi for your mobo + ram combo. Preferably at least one of the (or both) QVL lists have approval.
MSi QVL - Support For X399 SLI PLUS | Motherboard - The world leader in motherboard design | MSI Global
GSkill - https://www.gskill.com/en/product/f4-3000c16s-16gisb
Not saying that this is the end all be all cause, sometimes you can use non QVL combos but usually it is better to use what is tested to work together by at least ONE of the manufacturers (mobo or RAM manufacturer).
Trying to figure this out, do you use DR and most of the machines are render nodes or are all of these workstations? Just trying to figure out what causes your crashes as sometimes it can be properties of the scene itself. If they crash in a variety of scenes , then it is harder to troubleshoot but if some of the scenes have something in common, it can be that. As an example, there was a time our machines kept crashing when the RAM consumption went above 40GB and we found at that one of the RAM module was faulty.
I have a feeling it is your RAM. I have a reply that is pending moderation, but below is an edited version.
Also checking your memory modules compatibility, I cannot find QVL approved listings for BOTH GSkill and MSi for your mobo + ram combo. Preferably at least one of the (or both) QVL lists have approval.
Not saying that this is the end all be all cause, sometimes you can use non QVL combos but usually it is better to use what is tested to work together by at least ONE of the manufacturers (mobo or RAM manufacturer).
It's not only 3d studio max/corona/vray that's the problem, sometimes it will crash just sitting on the desktop. Yesterday, I rebooted the machine, signed in and went away (before starting any renders or anything), and then stepped away to make a coffee, came back and the screen was in standby mode. I couldn't get any display from the computer at all. I tried pressing the reset button - nothing. Tried to hold down the power button to get it to power off - nothing. I had to physically remove the power cord and plug it back in.
And this happens randomly on ALL 8 machines with the 2990WX? I assume all 8 are configured exactly the same (cpu, RAM, mobo, PSU, gpu, etc)?
Correct. They're all exactly identical in every way. Spec in my original post
Honestly, my suggestion would be this. It will either be the RAM, mobo, SSD, or GPU. Since all machines are identical, I would do the following to one machine first. List below is in the order that I would test it.
1. Buy/borrow 32 GB (or 64) of RAM that is approved in the QVL list of either manufacturer for the RAM-mobo combo. Make sure RAM is placed in correct slots since you are not filling up all 8 dimm slots.
2. Buy/borrow a cheap but reputable 256GB SSD (Samsung EVO) and install a fresh copy of windows.
3. Borrow a different GPU from someone and test that machine out with it.
4. Maybe try a new PSU, but to be honest 850W is enough imho since you do not OC at all and your GPU is not power hungry. You don't even have any mechanical HD's. But this is worth a shot.
The key is to change ONLY 1 parameter at a time and see if anything changes. I would change the mobo as the last resort as that is the most expensive and labor intensive thing to change. I doubt your all your CPU's are defective as it probably is the one component that gets rigorously tested the most by the manufacturer for quality control.
angryphoton, I am surprised you are not questioning the power supply as I am. Do you have a 2990WX? What power supply do you use and what processor if not 2990WX. I have an 850 Watt supply on my 1950X. I do tend to be a little strong on power and CPU cooling. I am not familiar with the specific memory but have run G.Skill for some time with great results. In particular sticks with Samsung B-Dies (like mine) seem to agree with Ryzen and vice versus. The RAM is plugged correctly. 32 GB will relieve the load on the PS and may make the system run. Thanks and enjoy, John.
I do run have a 2990WX, running on a 1000 watt PSU, but it also have a 1080ti, overclock my power limit to 300 watts (vs. default 250w), have 2 nvme drives, 1 mechanical HD, 5 fans and an AIO.
When I was tweaking my OC, I monitored the wattage I was pulling from the wall socket and it maxes out at 460 watts on average using a real world v-ray scene render that lasts 1 hour with my OC settings. I could put a higher OC setting but my cooling wasn't up to snuff and the power draw was getting too high (550watts at a 350 power ceiling setting). I just settled for a mild 300 watt OC and I pull in around 5300 in Cinebench. To be honest though the OC does not affect my real world renders that much, it just shaved off like 3 minutes of a 60~ render.
Anyway, the PSU can definitely be a cause but liamretans said that his machine would hang/crash doing nothing, just being on the desktop. That sort of makes it less likely to be a psu or a cooling problem because his machine wasn't even under load when it crashes. This makes it seem more like a hardware incompatibility problem or a setup problem with the OS. Not all GSkill RAM is agreeable to Threadripper, especially the 32 core one.
Here is my system:
AMD 2990WX
MSI Meg Creation X399
GSkill F4-2933C16Q2-128GTZRX (this is a solid 128 GB kit) - runs at 2933 MHz
EVGA 1080Ti Black Edition
Samsung EVO 970 500GB nvme
EVGA P2 1000wat Platinum PSU
Here is my RAM:
As you cane see my motherboard (and processor) is in the QVL list of GSkill. Strangely enough though this RAM is not in the MSi Meg QVL list, but as I said, at least one of them should be in some sort of QVL list.
Getting 2 X 32 GB kits (where the 32 GB kit is QVL listed) does not necessarily mean it will work when you use 2 kits together to form 64GB total. From what I have heard, the 2990wx is very finicky with RAM.
I don't think it's the SSD. I've tried adata, Intel, Samsung. I've tried SATA port 1 and SATA port 8. Didn't help.
If you have the same RAM Memory installed on all your computers, it might not be compatible according to MSI Support on RAM Memory for the Ryzen 2xxx WX Series QVL List. It only shows two G-Skill RAM: Support For X399 SLI PLUS | Motherboard - The world leader in motherboard design | MSI Global
Ryzen CPU are pretty sensitive to the type of RAM installed which is why you should follow the QVL list from your Motherboard's support. The Ram you have is Dual Channel RAM modules: https://www.gskill.com/en/product/f4-3000c16d-16gisb
The F4-2400 RAM (1.20 vdc) is Dual Channel and the F4-3200(1.35 vdc) is Quad Channel.
According to MSI Support concerning QVL List for Storage there is only one Sandisk SATA listed: Support For X399 SLI PLUS | Motherboard - The world leader in motherboard design | MSI Global
All this is just for your information only in case you weren't aware of it.
Your problem is probably to do with the CPU using all 32 cores. Found this website that is having freezes while rendering using Chaos software: https://forums.chaosgroup.com/forum/v-ray-for-3ds-max-forums/v-ray-for-3ds-max-problems/1016182-amd-...
Try using less core on the CPU when you are rendering and see if it continues to crash. In Chaos Rendering you can set how many cores you want the CPU to use as an example.
From what I have read, if you set in Ryzen Master to Legacy mode it will automatically disable a certain amount of CPU cores to make it more compatible with older games or programs.
By reducing the amount of cores also reduces the amount of power used which could indicate a PSU not strong enough to power all 32 cores under heavy loads. Here is this website showing the power consumption while Overclocked and at stock: AMD 2nd Generation Ryzen Threadripper 2990X 32 Core / 64 Thread CPU Review Ft. ASRock X399 Professio...
This website explains about Dynamic Local Mode effects on the Ryzen 2990WX and others: https://magazine.renderosity.com/article/4806/previewing-dynamic-local-mode-for-the-amd-ryzen-thread...
If possible you can use a less powerful Ryzen with 24 cores and see if it crashes also.
I looked at this and it definitely seems interesting. However I have been using, for about 2 years, a dual Xeon system with 36 cores total (72 threads) and I did not experience these problems, neither did I experience them with the 2990WX (but this is a new system).
liamretrams and angryphoton, it appears that my power supply concerns are going unheeded. That is fine, I only give my opinion. Both of you please do me a favor and provide some screenshots. Please download Thaiphoon Burner free version (read only), run it and post a screen shot. I do not post a link because doing so will cause moderation delay of a few minutes to a few days. Also please provide a RM screenshot under heavy render loads. Are either of you doing any OCing? I guess angryphoton is running an XMP to get 2933 MHz - or manual OC? Have either ever done a Clear CMOS? Thanks and enjoy, John.
OK, liamretrams , new tack. I can find your memory nowhere, except Newegg. I suspect it is not quad channel but may be dual channel. It is not on the MSI QVL list and not even listed in G.Skill. Please buy at least one Quad channel kit to test. I suggest F4-2933C14Q-64GFX (Flare-X). There is also a C16Q which will be much cheaper. I am trying to determine if either are Samsung B-Dies. Another good choice is the RAM that angryphoton is running (is it B-Dies?). Enjoy, John.
EDIT: I called G.Skill and asked about F4-2933C14Q-64GFX being Samsung B-Dies. I was told that they should be. Not sure what this means.
Like this?
Yes, angryphoton, thanks much. You have Samsung B-dies. This should work well in liamretrams machines and they have a good choice of XMPs. I am looking forward to seeing your RM screenshot(s). Thanks and enjoy, John.
Here's my screenshot of Thaipoon Burner. Ryzen Master screenshot forthcoming.
RAM speeds in the BIOS are set to "Auto", it has settled on 1066 (2133).
And here's a screenshot of RM while rendering a scene. I'm using 3DS Max/Corona.
This has been going since yesterday - so nearly 18 hours now. As I mentioned, the crashing is very inconsistent. Sometimes it can't make it 18 minutes before the screen goes black. Other times, its going for a day or two without issue.
This current thread may be useful in the respect his computer was automatically restarting with a Nvidia RX 2080 card installed. The OP said when he disabled half the cores in the 2990WX CPU the computer stopped crashing or restarting: Threadripper 2990WX restarts computer
This the comment he made about his "Fix":
I do not think it's a power supply problem. I have two power supplies one
1600 and the other one 800. I put the extra one in when I had a problem
with restarting.
I think I've diagnosed the issue the 32 core CPU is incompatible with the
CUDA Core on the RTX 2080 I knocked down to 16 core compatibility mode it
works fine. even if I only put one RTX 2080 with 32 cores Instantly
restarts the computer. it only happens with applications that run rendering
on the RTX 2080 Cuda even if I run Blender in Linux it's still restart
The thing is sometimes mine will crash when it's on the desktop with nothing happening. But it definitely crashes far more often under load.
I've disabled 50% of the cores in RM (see screenshot) using the "1/2 legacy compatibility mode", and the render has been running for 1 hour now, none of the 8 machines have crashed.
Trawling the cgsociety, vray and chaos group forums, I am finding that there are a TON of people with similar issues but this issue isn't isolated to Ryzen 2990WX. It seems more tied into how NUMA nodes etc work. People with more than 1 socket keep coming up in my searches. While ryzen doesn't have more than 1 socket, the effective setup is simliar - multiple NUMA nodes and a higher core count by a significant amount.
I will report back in 24 hours.
What applications are you running they're all cpu-intensive? I have a 64GB 2990WX And I have run Maxwell render Modo and 3D Studio Max with incredibly large data and I've had no problem at all. I ran Modo the peak 98% CPU usage and 95% memory usage for four straight days with no problem. unless they start using the GPU that's when everything goes south
The only think I see is that I have (8GBx8) DDR4/3000MHz Quad Channel Memory not 16GBx4
liamretrams, this is sounding more like s software problem to me, especially james767. I almost hate to go there, but have any of you looked into coreprio? It deals with uma/numa matters. I had been using it to boost benchmark performance but the latest version hung my system, so I removed it. Please take a look at this thread. Enjoy, John.
Yes - we’ve tried Coreprio. No improvement.
I can’t tell either way if it’s a software or hardware fault. After reading the forums, it seems like a little of column a and a little of column b
liamretrams, thanks. I do hope you read my thread. I got about a 65% improvement with Indigo Supercar. There is a response to my hangs from the author who reports that Coreprio improvement only occurs about half the time and nobody knows why. Enjoy, John.
elstaci wrote:
This current thread may be useful in the respect his computer was automatically restarting with a Nvidia RX 2080 card installed. The OP said when he disabled half the cores in the 2990WX CPU the computer stopped crashing or restarting: Threadripper 2990WX restarts computer
This the comment he made about his "Fix":
I do not think it's a power supply problem. I have two power supplies one
1600 and the other one 800. I put the extra one in when I had a problem
with restarting.
I think I've diagnosed the issue the 32 core CPU is incompatible with the
CUDA Core on the RTX 2080 I knocked down to 16 core compatibility mode it
works fine. even if I only put one RTX 2080 with 32 cores Instantly
restarts the computer. it only happens with applications that run rendering
on the RTX 2080 Cuda even if I run Blender in Linux it's still restart
check for a new BIOS for your motherboard
this is a known fault with the nVidia turing cards with dual cpu motherboards with lots of cores too
liamretrams, thanks much for the screenshots. Your memory chips are the less attractive Hynix. That is why it is a little slower and much cheaper. The default speed is the SPD speed. Almost all memory defaults like this. You notice that PPT and TDC are nearing the maximum. If you lifted the limit (PBO required) then your system will run faster but get hotter and demand more power. I do not think you have the power to give and do not recommend it. Your maximum temperature is 68C, so some room there. It might be interesting to lower PPT and TDC which will reduce the temperature and power use and see if hangs are reduced. I have not given up on my power supply theory, but do think the main hang is caused by the memory now that I look. I hope you will get one quad kit of really good memory and see if that is a big help. Thanks and enjoy, John.