I am using a Inspur server with a 2CPU 8GPU configuration that includes two EPYC 7543 CPUs and 8 RTX 3090 graphics cards. I installed Ubuntu operating system, but when I checked the CPU temperature using the "sensors" command after entering the machine, I found that the Tctl/Tdie temperature was significantly abnormal, much higher than Tccd, exceeding 80 degrees Celsius, while Tccd is generally only 30-40 degrees Celsius. When I run my machine learning code, the Tctl/Tdie temperature can reach 95 degrees Celsius. When I use all 8 graphics cards at the same time, the program encounters an unknown error, while the same code runs without problems on other machines with the same configuration.
k10temp-pci-00cb
Adapter: PCI adapter
Tctl: +93.9°C
Tdie: +93.9°C
Tccd1: +41.8°C
Tccd2: +42.8°C
Tccd3: +36.2°C
Tccd4: +34.5°C
Tccd5: +36.8°C
Tccd6: +40.5°C
Tccd7: +40.5°C
Tccd8: +39.0°C
nvme-pci-c400
Adapter: PCI adapter
Composite: +30.9°C (low = -273.1°C, high = +79.8°C)
(crit = +82.8°C)
Sensor 1: +30.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +41.9°C (low = -273.1°C, high = +65261.8°C)
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +95.2°C
Tdie: +95.2°C
Tccd1: +42.8°C
Tccd2: +38.8°C
Tccd3: +38.2°C
Tccd4: +37.8°C
Tccd5: +44.2°C
Tccd6: +43.5°C
Tccd7: +41.8°C
Tccd8: +37.0°C