AnsweredAssumed Answered

epyc 7551 spontaneously resets after 10mins rendering

Question asked by goodguy on Dec 28, 2017
Latest reply on Apr 23, 2018 by koralle

I just finished building a system, dual 7551 epyc cpus using the supermicro H11DSi-NT motherboard.

The build went very well, and the system is running just fine, and the performance is extraordinary.

 

I am using Fedora 27 linux, but have access to about 20 different linux distros, as

I maintain cinelerra-5.1.  I need to build these distros to post deliverables periodically.

The build went very well, and the system is running just fine, but...

 

It actually takes a little effort to cook up a way to load it to capacity.

I can run a full linux build of  Linus Torvalds git repo in about 11 mins, no problems.

Using: make -j200   this saturates the machine for over 10 minutes.  Very nice.

 

However,

If you start 50 background render clients, and run a batch dvd render using the

render farm, I see that it nearly always spontaneously resets (no warning or log messages,

just as if the reset button was pushed) after about 10 minutes.  The motherboard is equipped

with IPMI which allows you to monitor "server health" (thermal sensors, voltages, fans).

There are no measured parameters which are even close to any rails.  Everything looks

just fine, but it is highly reproducible.

 

This job does not saturate the machine.  It runs at about 85% utilization, probably due

to io delays created by 50 clients accessing media files.  It is conspicuous because all

of the kernel panic code outputs all kinds of logging, and tries to resuscitate the machine

in a pretty vigorous way.  This does not happen.  It is as if the reset button was pushed.

 

Can a HT sync/reset packet do this?

 

If anyone in silicon validation would like to try this,

I will be glad to help set up a test case.

This is sort of tricky to setup.

 

I am a skilled linux developer, and I can set up a kdb session to trap the reset,

but I suspect it is vectoring to the bios reset, not the kernel, and so this may not

be of any help, but I am open to suggestions.

 

gg

PS: attached: bill_of_materials, dmidecode, lspci

Outcomes