I've got 8 x machines with this spec:
- ThreadRipper 2990WX + DeepCool 240mm CPU cooler (4 x hi-speed fans on the radiator)
- MSI X399 SLI Plus Motherboard (Bios A.70, latest at time of writing)
- G.Skill Aegis 4 Memory 64GB (4 x 64GB Modules) - F4-3000-C16-16GISB
- nVidia GeForce 1030 GPU - Driver 184.108.40.2061 (Latest at time of writing)
- Sandisk SSDA480G SSD (480GB 2.5" SATA) - Latest Firmware
- FSP Hydro G 850W Power Supply
- Win 10 Pro x64 Build 1809
These machines are used for rendering 3D visualization simulations. This is a CPU heavy task. The software we use is 3D Studio Max with a range of plugins and add-ons (v-ray, corona etc)
Unfortunately as soon as the render kicks off, some of the 8 machines will crash withing 10-15 minutes. If I wait long enough (48 hours), the rest of the machines will crash also. None of the machines are stable, like the Intel i7 4770Ks they replaced - those would run for weeks at 100% CPU happily.
When I say "Crash", I mean the screen goes black, no image to screen. Keyboard numlock doesn't work. Even the power or reset buttons don't work - I have to physically remove the power cord and re-plug it in. There is no bluescreen or anything. It just stops working. The event log simply shows that the previous shutdown at <time> was not unexpected.
Things I have tried so far:
- Latest drivers, bioses and firmware versions for the SSD
- Reinstalled windows & drivers
- Run sfc and dism to check that the system files aren't damaged
- Checked temperatures to ensure the CPU/GPU arent overheating (it isn't).
- Run synthetic benchmarks like IntelBurnTest and Memtest86.
- Physically relocated the affected machines to another office to see if this is an environmental issue (didn't help).
During synthetic benchmarking the system will sometimes crash, other times remain stable for 48+ hours.
What is going on here? All 8 machines are doing this.