Based on your description, it is difficult to determine if this is a CPU or vendor platform issue. I would recommend that you reach out to Super Micro for troubleshooting assistance.
Their contact information can be found here.
maybe have a look @ epyc 7551 spontaneously resets after 10mins rendering
seems like some strange things are going on under full load.
Some questions also ..
are you using the built-in ast VGA ?
what is your IPMI firmware version ?
We are using the built-in AST onboard VGA.
Latest bios and firmware version from Supermicro BIOS 1.0b and ipmi 1.27.
We tested a few sets of memory and a few sets of EPYC CPUs. Most likely it is a Supermicro platform issue.
I am not sure how CentOS issues the reboot command to the hardware. It seems like the reboot command doesn't trigger a complete BIOS reboot. Power cycle always works...
Supermicro hasn't experienced this issue yet. Does anyone know what is the chain of events that trigger a warm reboot from the OS?
1 of 1 people found this helpful
I got a lot issues with IPMI 1.27 ( and not yet released 1.28 too )..
With 1.27 my FANs are going mad , some things cannot be set at all from the web interface
also I noticed sometimes shutdown -h now , issued from the OS didn't work.
I would ask Supermicro to give you the old IPMI firmware and test with this one.
Is what I did .. I have now IPMI 1.14 firmware installed.
Be sure after you downgrade to AC power cycle the box ..
If that doesn't help we can try to debug the issue.
Going from 1.14 to 1.26, it seems like the fan speeds are more steady. On the older ipmi, the fan speeds ramp up and down more often. Going from 1.0a BIOS to 1.0b BIOS seems to have fixed the random MCE errors that gets logged in /var/log/message (CentOS 7.4). I will try the 1.14 firmware and see if the issue still occurs.
Thanks for your quick replies!
Please make sure you update to the latest released BIOS from Supermicro as well as making sure you are also on the latest firmware for the BMC as well.
The issue is present with the latest 1.0b BIOS version and the latest 1.27 BMC firmware.
I tried to use BMC/ipmi firmware version 1.14, but the issue still persists.
So far, we have tried BIOS 1.0a, 1.0b, BMC/ipmi 1.14,1.26,1.27, 3 H11-Dsi Supermicro motherboards, 3 sets of AMD EPYC processors, 3 sets of 16 pcs DIMMs (8GB, 16GB). The same issue persists.
I am guessing it might be a BIOS issue or an ACPI issue?