We have a few Supermicro H11-DSi server with Dual EPYC 7351 and 16 x 8GB DIMMs (Micron 2666). After running full load (LQCD benchmark) for longer than 5 to 10 minutes, issuing reboot command in CentOS 7.4 will cause the system to go to a blank screen. The system will not post. Power cycle will bring the system back to POST everytime. Without running full load, issuing reboot command works most of the time. Has anyone encounter this issue? We are on BIOS 1.0b. I am just not sure if it's a memory problem or a BIOS problem.
Based on your description, it is difficult to determine if this is a CPU or vendor platform issue. I would recommend that you reach out to Super Micro for troubleshooting assistance.
Their contact information can be found here.
maybe have a look @ epyc 7551 spontaneously resets after 10mins rendering
seems like some strange things are going on under full load.
Some questions also ..
are you using the built-in ast VGA ?
what is your IPMI firmware version ?
We are using the built-in AST onboard VGA.
Latest bios and firmware version from Supermicro BIOS 1.0b and ipmi 1.27.
We tested a few sets of memory and a few sets of EPYC CPUs. Most likely it is a Supermicro platform issue.
I am not sure how CentOS issues the reboot command to the hardware. It seems like the reboot command doesn't trigger a complete BIOS reboot. Power cycle always works...
Supermicro hasn't experienced this issue yet. Does anyone know what is the chain of events that trigger a warm reboot from the OS?
I got a lot issues with IPMI 1.27 ( and not yet released 1.28 too )..
With 1.27 my FANs are going mad , some things cannot be set at all from the web interface
also I noticed sometimes shutdown -h now , issued from the OS didn't work.
I would ask Supermicro to give you the old IPMI firmware and test with this one.
Is what I did .. I have now IPMI 1.14 firmware installed.
Be sure after you downgrade to AC power cycle the box ..
If that doesn't help we can try to debug the issue.
Going from 1.14 to 1.26, it seems like the fan speeds are more steady. On the older ipmi, the fan speeds ramp up and down more often. Going from 1.0a BIOS to 1.0b BIOS seems to have fixed the random MCE errors that gets logged in /var/log/message (CentOS 7.4). I will try the 1.14 firmware and see if the issue still occurs.
Thanks for your quick replies!
The issue is present with the latest 1.0b BIOS version and the latest 1.27 BMC firmware.
I tried to use BMC/ipmi firmware version 1.14, but the issue still persists.
So far, we have tried BIOS 1.0a, 1.0b, BMC/ipmi 1.14,1.26,1.27, 3 H11-Dsi Supermicro motherboards, 3 sets of AMD EPYC processors, 3 sets of 16 pcs DIMMs (8GB, 16GB). The same issue persists.
I am guessing it might be a BIOS issue or an ACPI issue?