We recently bought a machine equipped with:
CPU AMD Ryzen 9 3950X,
RAM 128GB DDR4 3000MHz,
SSD 1TB + 2xHHD 6TB,
GPU NVIDIA GEFORCE RTX 3090 24GB ,
OS Ubuntu 20.04 LTS,
PSU 850W Certified
We use the machine remotely for doing AI-based research. We had several issues related to an annoying bug when we have a load on CPU. Specifically, the errors are freezing completely the machine and the console returns:
Message from syslogd@machinename at Feb 13 09:37:16 ...
kernel:[ 348.578682] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [systemd-journal:660]
After some print of the above issue, the machine is not reachable in ssh. It is only possible to phisically restart the machine.
We run experiments for weeks on GPU without any problem, once we load the CPU for some tasks, it freezes and report the above issue.
Has anyone experienced the same problem? How can we solve it?
Thank you for your time and support.
I have the same issue with 3900x and RTX 3090 GPU. I also use Ubuntu 20.04 LTS and use python numba library for some fast binary search function. Also, A compiled dynamic library with cython also have this issue for me. Have you found any solutions?
Look in the systemd logs for previous errors associated with "watchdog" -
sudo journalctl -p err |grep "watchdog: BUG"
Also, look for which programs triggered the watchdog errors.
In my logs, there are tons of watchdog errors - most of these errors are triggered by Chrome and Firefox. For example -
watchdog: BUG: soft lockup - CPU#11 stuck for 1587s! [Chrome_~dThread:1629]
Only Chrome caused a system freeze, so I do not use it anymore. I suspect that it is a Glibc issue.