cancel
Showing results for 
Search instead for 
Did you mean: 

PC Processors

jkberlin
Journeyman III

Threadripper 2990WX looses SATA connection

We are running several systems with Linux/Nvidia-CUDA for molecular modeling.

Especially Threadripper CPUs are very effective and our "best choice" together with latest NVIDIA GPUs (1080 ... 2080ti).

There is only one system that constantly produces errors since months:

CPU Threadripper 2990WX, Board Asus Prime X399-A

FIRST PROBLEM with SATA controller (late 2018)

We do upgrades weekly, and reboot systems thereafter. This system could not reboot in "hot condition" , in the BIOS the SATA SSDs were not shown / not recognized.

After cooling down the system for at least one hour, the SSDs were recognized and booting was possible.

This problem existed since the start of the new system in late 2018. We flashed every new BIOS release but with no success. But we could live with that and were hoping that a BIOS update would solve the problem sooner or later.

NEW INSTALL, ENHANCED PROBLEM with SATA controller (spring 2020)

This spring, the system was reinstalled on two SSDs (SATA) with latest Ubuntu 20.04 LTS.

Now, during operation the connection to the second SSD (which contained the home) was lost after some operation time (2 to 5 days).

THIRD INSTALL, ENHANCED PROBLEM with SATA controller even with a single SSD (summer 2020)

Again, we reinstalled the system, this time only on a single disk. But the problem reoccured again.

We were thinking about changing the motherboard - but I looked and found that the SATA controller is integrated on the Threadripper CPU. SATA port was changed, no success.

Has anybody an idea what we could do? Any AMD engineers out here?

Thanks for all your help and support.

0 Likes
1 Reply
Thanny
Miniboss

If the problem is when the computer is in a "hot condition", the first obvious guess would be to ensure you have adequate CPU cooling and case ventilation.  The SATA controller is not on the CPU.  It's on the X399 chipset.  If you're letting that chipset overheat, that could explain your problem.