We are running several systems with Linux/Nvidia-CUDA for molecular modeling.
Especially Threadripper CPUs are very effective and our "best choice" together with latest NVIDIA GPUs (1080 ... 2080ti).
There is only one system that constantly produces errors since months:
CPU Threadripper 2990WX, Board Asus Prime X399-A
FIRST PROBLEM with SATA controller (late 2018)
We do upgrades weekly, and reboot systems thereafter. This system could not reboot in "hot condition" , in the BIOS the SATA SSDs were not shown / not recognized.
After cooling down the system for at least one hour, the SSDs were recognized and booting was possible.
This problem existed since the start of the new system in late 2018. We flashed every new BIOS release but with no success. But we could live with that and were hoping that a BIOS update would solve the problem sooner or later.
NEW INSTALL, ENHANCED PROBLEM with SATA controller (spring 2020)
This spring, the system was reinstalled on two SSDs (SATA) with latest Ubuntu 20.04 LTS.
Now, during operation the connection to the second SSD (which contained the home) was lost after some operation time (2 to 5 days).
THIRD INSTALL, ENHANCED PROBLEM with SATA controller even with a single SSD (summer 2020)
Again, we reinstalled the system, this time only on a single disk. But the problem reoccured again.
We were thinking about changing the motherboard - but I looked and found that the SATA controller is integrated on the Threadripper CPU. SATA port was changed, no success.
Has anybody an idea what we could do? Any AMD engineers out here?
Thanks for all your help and support.