I've got 8 x machines with this spec:
These machines are used for rendering 3D visualization simulations. This is a CPU heavy task. The software we use is 3D Studio Max with a range of plugins and add-ons (v-ray, corona etc)
Unfortunately as soon as the render kicks off, some of the 8 machines will crash withing 10-15 minutes. If I wait long enough (48 hours), the rest of the machines will crash also. None of the machines are stable, like the Intel i7 4770Ks they replaced - those would run for weeks at 100% CPU happily.
When I say "Crash", I mean the screen goes black, no image to screen. Keyboard numlock doesn't work. Even the power or reset buttons don't work - I have to physically remove the power cord and re-plug it in. There is no bluescreen or anything. It just stops working. The event log simply shows that the previous shutdown at <time> was not unexpected.
Things I have tried so far:
- Latest drivers, bioses and firmware versions for the SSD
- Reinstalled windows & drivers
- Run sfc and dism to check that the system files aren't damaged
- Checked temperatures to ensure the CPU/GPU arent overheating (it isn't).
- Run synthetic benchmarks like IntelBurnTest and Memtest86.
- Physically relocated the affected machines to another office to see if this is an environmental issue (didn't help).
During synthetic benchmarking the system will sometimes crash, other times remain stable for 48+ hours.
What is going on here? All 8 machines are doing this.
Hey liamretrams , sorry for a a little offtopic but how did you managed to get to 3.4Ghz while rendering (on a your screen that you posted from RM) without OC (You stated you do not overlock)? I am a bit curious because out of box I am able to get on my TR 2990WX only up to 3Ghz and with EZ system tuning set to normal i can get to 3.2Ghz but that's it.
- Corsair AIO 115i RGB Platinum 280mm. - custom fan curves
- Case is a Phanteks Enthoo Pro TG (tempered glass, full tower).
2 X Noctua NF-A14 iPPC-2000 PWM 140mm for the front intake (custom curve)
1 X CORSAIR AF140 LED Low Noise Cooling Fan 1400 for rear exhaust (at 40% voltage/power)
1 X Scythe Slip Stream Slim 120mm Case Fan for bottom intake (at 50% voltage/power)
Removed - 2 X 140mm Phanteks fans that came with the case.
Case is super silent when idle, gets a bit audible when fully rendering, still tweaking the fan curves to get best cooling while rendering but I am getting pretty close. vRay rendering tops out at 60'C depending on room temperature, now it is cold in the room. This is with the mild overclock.
Ok. 24 hours in - only 1 of 8 machines has crashed. Far from an ideal result but a massive improvement nonetheless. I am trying to acquire the F4-2400C15D-32GFXR which is on the QVL. F4-2933C14Q-64GFX which misterj suggested is not available in my region, neither is F4-2933C16-16G that angryphoton has.
I can however get the ADATA AD4U2666316G19-B which is on the QVL also. It is a single module, but it looks like MSI has tested it in quad module config. 2400 Mhz supported speed with Nanya NT5AD1024M8A3-HR chipset.
The ADATA is readily available. Shall I just pick up a few of those?
Not trying to make your life harder but if possible, stick with Samsung B dies as people have reported much better compatibility with those. Also try looking at the G Skill QVL lists and see if any of their ram (that is available to you) is compatible to your motherboard.
liamretrams, here is a link (moved) of B-Die sticks. It is in German and here is all the German I know: ja=yes and nein=no.
What have you changed to get the 24 hour results? Whatever you do, I suggest you try a single quad kit for testing. See if you can get F4-2400C15q-64GFXR (only USD 350 on Newegg, USD 500 Amazon). You really want a quad kit, not two dual kits. Enjoy, John.
EDIT: Moved link to next post to avoid moderation delay.
Thanks, liamretrams. I commented above and am waiting for my moderated link to clear for you. One thing for sure, half the cores did reduce your power supply load a lot. Did it double the time to complete the task? Please post a screenshot of RM. Thanks and enjoy, John.
It has been four or five days now with half of the CPU cores disabled. Has the rest of the seven computer crashed since then?
I know you mentioned one crashed after about a day of rendering. Just wondering how many of the other seven has crashed since then.
If the other seven haven't crashed maybe you can enable another 8 cores on the CPU and see if it crashes again. With 24 cores if should be faster in rendering time.
With half the CPU cores disabled, only 1 of the 8 machines crashed. With all cores enabled, the machines still crash in between 1-12 hours.
I tried running memtest on 2 sticks at a time (I have 4 sticks total), then memtest on all 4 sticks. Ran intelburntest. No improvement. Still crashes unpredictably.
I've now acquired a RX580 and plugged it into the system (PCI-e slot 1). The system is rendering, I will report back in 24 hours.
liamretrams, please, unless you are dead set on not buying a larger power supply: get one at least 1000W supply, install it in the system that continues to fail on 16 cores, enable all 32 cores and try a render. If it works or not try it on a system that runs on 16 cores but fails on 32.
Some more comments: You can simply drag-n-drop images into your replies, no external links needed. Since your PPT is at 100% causing throttling, you could run a little faster till your CPU temperature causes throttling by enabling PBO and raising PPT limit. A better CPU cooler would be needed to go even faster. I would suggest you not even try these experiments without a 1000Watt supply. This is all up to you, of course. Thanks and enjoy, John.
My CineBench settings:
The OP does not need a bigger PSU the one he has now is more than enough.
More likely a new BIOS for his motherboard and better rendering software are obviated
As I said in my original post, my bios is up to date.
Changing rendering software isn’t an option.
Anyway - as I mentioned yesterday I’ve got a RX580 hooked up now and all cores enabled.
The render der has been running for 16 hours. It’s a pretty long time but too soon to call success.
Im gonna restart the box at exactly 24 hours and let it run another 24 hours. If it runs successfully I’ll consider getting AMD GPUs for the rest of my farm. At least one other person has had issues with nVidia cards in this build.
With regard to GPUs - So far I have tried
- GT710 GPU - crashes
- GT103 GPU - crashes
- RX580 - too soon to tell but looks promising
A huge thanks to everyone that is lending their opinion.
Update us on how this goes. Really curious about this GPU thing as I never had the experience of a GPU causing crashes to rendering software especially since you are not using GPU based rendering. It could still be possible though as you crash even on desktop doing nothing so it's more like an overall system stability issue. Do check on the RAM though, as I think that is still a big possibility.
It doesn't only crash under load, it will sometimes just crash sitting on the desktop. This is why I don't think its a load (or power) related issue, but some sort of idiosyncratic compatibility problem that manifests under certain circumstances. That is why I'm chasing this down first. Sometimes troubleshooting is about chasing down the easy stuff first. GPU, RAM is all very easy. PSU comes last. If this doesn't pan out, I will go down that route.
PS is on my list of things to test. I have a plan to follow. Since the machines are crashing at idle while sitting on desktops sometimes, I don't think it's a load or power related issue.
MSI Mainboard support themselves told me to try a different GPU - so I'll try that, and if that doesn't pan out I'll run down the rest of my list (PSU, QVL RAM).
I know this is a threadripper thread but I like to share my similar experience with building a Ryzen 1700, 1700X, 1600X and 2700. My most problematic build was the 1600X and it suffers similar system on idle and on work when you are just about to call it day the mouse freezes or on save. The ram was a QVL listed G.Skill FlareX 2400 DDR 2x8 GB kit for Ryzen, WDBlack NvMe boot disk, Sapphire Nitro R7 370 4GDDR5, 500VA Eaton UPS,Thermaltake Tough Power Gold 550W on an Asrock Pro4 350M. It was first started stable for 3 weeks and started to exhibit similar faults like you have. On the event viewer, it listed unexpected shut down and much later fault on power. I felt the the vrm is hotter than expected I reported it the supplier and returned the motherboard for test. I had to buy a similar motherboard as I need to work. All went well for the next 4.5 months, and had a extra new motherboard as the previous was considered for RMA. The office decide to pay for the pc so it was off my hands and I got laid off as the production office duties was shut and moved to sales. The PC went back to the cycle of freezing on idle on the day I left.
With a spare motherboard on hand I built another system with a Ryzen 2700, Adata XPG 7000 256 GB NVNe, Adata XPG 2 x 8 16 GB DDR4DRS kit, Radeon PRO WX3100 and Antec EAG 550W EarthWatts Gold Pro. It rock stable as my other Ryzen 1700X build also with an Antec True Power Classic Gold 550W, G.Skill RipJawsV 3000 2 x 16 GB DDR4 ram kit, Intel 256 GB 600P NVMe, Radeon Pro WX4100 on an Asrock Pro4 350 motherboard. I run IRONCAD with model rendering on KeyShot with KeyShot online videos running on Firefox for hours and Windows Live Mail also open.
My observation,Ryzen is finicky on psu power delivery.
I work for multiple clients and service a range of boxes. I have a ton of Rzen 1600,1700,2700 etc builds out there. None of them have any issues with crashing. This certainly isn't my first rodeo, but is proving to be one of the more difficult ones.
I have researched on reviews on your FSP Hydro G 850W and it highly rated. Techpower up found something unusual though
The hardiest thing these are identical 8 setups and costly. You have built other setup and never had problems. I assume you used the same brand class of psu on others that worked well. I do look forward a possible explanation and resolution for your problem be posted here.
Did you also used the same G.Skill (F4-3000-C16-16GISB) Aegis 4 Memory 64GB (4 x 16GB Modules) class with lower capacity on the other builds? I relied on OCCTPT to test stability run for 5 mins, if it survive the memory test and psu it is good to go. Apologies to ask.
If it is a PSU issue it would continue to crash all the time intermittently. But since the OP disabled half the CPU's core he hasn't had any more crashes except for one computer out of eight. This indicates either hardware or software incompatibility with a fully functioning 32 core CPU.
I, myself, am a convert to OCCT. It is the only diagnostic program that test PSU's. and errors in GPU's.
I was wondering myself on how much each of his computer is worth. My guesstimate is between $3000.00 and $4000.00 each. The CPU alone cost around $1700.00. The OP seems to have installed High Quality hardware.
Except maybe the GPUs. The GT 1030 is a fairly low powered GPU card (PSU 300 watts minimum).
I have Avid Composer so I am aware of the score for working with video. Avid MXF as the main feed but it can work with other formats to some extent. Then when I am done with the editing and compositing I can use Handbrake to make a video for the client to review.
When its approved the MXF is then sent to the broadcaster
GPU theory didn't pan out.
What has worked though is removing 50% of the RAM from the system. Systems originally had 4 x F4-3000-C16-16GISB for a total of 64GB. These are not on the QVL.
With only 2 modules installed, the system has completed 48 hours of rendering without crashing (with a reboot purposely triggered by me every 24 hours). Going for 72 now just to make 100% sure.
I'll try some different modules and see how it goes.
Try using this:
xxx.gskill.com/en/configurator?manu=55&chip=3423&model=3435 - replace xxx with www
Even though the MSI QVL might not show the RAM , but GSkill has it, it is worth a shot. Note that do not buy 2-32Gb kits to make 64GB, buy the solid 64GB kit.
Is that with all 32 cores enabled?
Seems like you may have found the answer. With 4 Memory modules installed looks like it causes incompatibility issues. But if it continues to work for the next 3-4 days with 32 cores enabled and just 2 Ram memory installed it a good indication that the motherboard or CPU has problems with all RAM Slots populated at the same time with the RAM you have. Or it is possible the RAM with 4 modules installed in not compatible with the motherboard.
Other Users in this Forum in the past has seen similar results. When they have all the Memory slots populated, the Computer BSODs, but when they have just one or two Ram installed it works fine and is stable.
I think he only used 4 slots in his original 64GB setup (please correct me if I am mistaken). X399 boards have 8 DIMM slots so he actually only used 4 and now dropped it down to 2 slots (that's how I understood it). I actually have all my DIMM slots populated as I carry 128GB RAM, no issues so far except I need a better CPU cooler.
I really think it bold down to memory compatibility with your motherboard. Also the 2990WX put "special" conditions on compatibility as most of the QVL lists I see have a separate QVL list for X399 and X399 (2990WX) setups.
That is correct. 8 slots total, System has (had) 4 x 16 to make a total of 64. System now has 2 x 16 to make a total of 32.
It crashes with 64. Stable with 32.
Edit: Even then, on some occasions, it will run fine for 24 hours, then the very next render you kick off will cause it to crash. I should note that the scene I'm testing with only requires about 18 GB of memory, so when rendering only 22 GB of memory is in use.
Yes, all cores enabled, DLM off, SMT on. 48 hours, render stable.
Just for kicks, I enabled XMP 2 (with only 2 sticks / 32 GB total memory) and also PBO. Trying to do everything I can to make it 'unstable' and crash. Yet it seems stable. It appears the DIMM count is the real problem here.
The plan was to run 64GB for now and have the option available to upgrade to 128 later by getting more modules... this throws a bit of a spanner into the works.
This is why I bit the bullet and bought 128GB kit at the start. Dropping $1500 on RAM was quite painful though. However, I just tested an old "medium" sized scene I had and memory went all the way up to 48GB, made me a bit happier as now I can be more at ease with future large and XL scenes.
Try this RAM:
As I think it can be upgraded to 128 as they also have a 128GB kit right below the 64GB versions and the models look the same.