cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

goodguy
Adept II

epyc 7551 spontaneously resets after 10mins rendering

I just finished building a system, dual 7551 epyc cpus using the supermicro H11DSi-NT motherboard.

The build went very well, and the system is running just fine, and the performance is extraordinary.

I am using Fedora 27 linux, but have access to about 20 different linux distros, as

I maintain cinelerra-5.1.  I need to build these distros to post deliverables periodically.

The build went very well, and the system is running just fine, but...

It actually takes a little effort to cook up a way to load it to capacity.

I can run a full linux build of  Linus Torvalds git repo in about 11 mins, no problems.

Using: make -j200   this saturates the machine for over 10 minutes.  Very nice.

However,

If you start 50 background render clients, and run a batch dvd render using the

render farm, I see that it nearly always spontaneously resets (no warning or log messages,

just as if the reset button was pushed) after about 10 minutes.  The motherboard is equipped

with IPMI which allows you to monitor "server health" (thermal sensors, voltages, fans).

There are no measured parameters which are even close to any rails.  Everything looks

just fine, but it is highly reproducible.

This job does not saturate the machine.  It runs at about 85% utilization, probably due

to io delays created by 50 clients accessing media files.  It is conspicuous because all

of the kernel panic code outputs all kinds of logging, and tries to resuscitate the machine

in a pretty vigorous way.  This does not happen.  It is as if the reset button was pushed.

Can a HT sync/reset packet do this?

If anyone in silicon validation would like to try this,

I will be glad to help set up a test case.

This is sort of tricky to setup.

I am a skilled linux developer, and I can set up a kdb session to trap the reset,

but I suspect it is vectoring to the bios reset, not the kernel, and so this may not

be of any help, but I am open to suggestions.

gg

PS: attached: bill_of_materials, dmidecode, lspci

0 Likes
1 Solution
Anonymous
Not applicable

AMD has identified an issue with the Linux cpuidle subsystem whereby a system using a newer kernel(4.13 or newer) with SMT enabled (BIOS default) and global C state control enabled (also BIOS default) may exhibit an unexpected reboot. The likelihood of this reboot is correlated with the frequency of idle events in the system. AMD has released updated system firmware to address this issue. Please contact your system provider for a status on this updated system firmware. Prior to the availability of this updated system firmware, you can work around the issue with the following option:

Boot the kernel with the added command line option idle=nomwait

Thank you goodguy and abucodonosor for providing us with the workload that allowed us to replicate the issue you were experiencing. Also, I would like to recognize koralle for understanding how to implement a workaround in the meantime, independent of our findings and recommendations. 

View solution in original post

56 Replies