cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

sho1sho1sho1
Adept I

EPYC 7351 + Supermicro H11-DSi soft reboot blank screen

We have a few Supermicro H11-DSi server with Dual EPYC 7351 and 16 x 8GB DIMMs (Micron 2666).  After running full load (LQCD benchmark) for longer than 5 to 10 minutes, issuing reboot command in CentOS 7.4 will cause the system to go to a blank screen.  The system will not post.  Power cycle will bring the system back to POST everytime.  Without running full load, issuing reboot command works most of the time.  Has anyone encounter this issue?  We are on BIOS 1.0b.  I am just not sure if it's a memory problem or a BIOS problem.

Please help!

0 Likes
8 Replies
Anonymous
Not applicable

Hi sho1sho1sho1,

Based on your description, it is difficult to determine if this is a CPU or vendor platform issue. I would recommend that you reach out to Super Micro for troubleshooting assistance.

Their contact information can be found here.

0 Likes
abucodonosor
Adept III

Hi ,

maybe have a look @ epyc 7551 spontaneously resets after 10mins rendering

seems like some strange things are going on under full load.

Some questions also ..

are you using the built-in ast VGA ?

what is your IPMI firmware version ?

Regards,

Gabriel C

0 Likes

Hi Gabriel,

We are using the built-in AST onboard VGA. 

Latest bios and firmware version from Supermicro BIOS 1.0b and ipmi 1.27.

We tested a few sets of memory and a few sets of EPYC CPUs.  Most likely it is a Supermicro platform issue.

I am not sure how CentOS issues the reboot command to the hardware.  It seems like the reboot command doesn't trigger a complete BIOS reboot.  Power cycle always works...

Supermicro hasn't experienced this issue yet.  Does anyone know what is the chain of events that trigger a warm reboot from the OS?

Simon.

0 Likes

sho1sho1sho1

I got a lot issues with IPMI 1.27 ( and  not yet released 1.28 too )..

With 1.27 my FANs are going mad , some things cannot be set at all from the web interface

also I noticed sometimes shutdown -h now , issued from the OS didn't work.

I would ask Supermicro to give you the old IPMI firmware and test with this one.

Is what I did .. I have now IPMI 1.14 firmware installed.

Be sure after you downgrade to AC power cycle the box ..

If that doesn't help we can try to debug the issue.

Regards

Hi,

Going from 1.14 to 1.26, it seems like the fan speeds are more steady.  On the older ipmi, the fan speeds ramp up and down more often.  Going from 1.0a BIOS to 1.0b BIOS seems to have fixed the random MCE errors that gets logged in /var/log/message (CentOS 7.4).  I will try the 1.14 firmware and see if the issue still occurs.

Thanks for your quick replies!

0 Likes

sho1sho1sho1

Yes BIOS 1.0b fixes the MCE22 errors ..however something is wrong with the IPMI firmware >= 1.27.

At least here.

0 Likes
bob_shaw
Staff

Please make sure you update to the latest released BIOS from Supermicro as well as making sure you are also on the latest firmware for the BMC as well.

0 Likes

Hi Bob,

The issue is present with the latest 1.0b BIOS version and the latest 1.27 BMC firmware.

I tried to use BMC/ipmi firmware version 1.14, but the issue still persists. 

So far, we have tried BIOS 1.0a, 1.0b, BMC/ipmi 1.14,1.26,1.27, 3 H11-Dsi Supermicro motherboards, 3 sets of AMD EPYC processors, 3 sets of 16 pcs DIMMs (8GB, 16GB).  The same issue persists.

I am guessing it might be a BIOS issue or an ACPI issue?

0 Likes