I just finished building a system, dual 7551 epyc cpus using the supermicro H11DSi-NT motherboard.
The build went very well, and the system is running just fine, and the performance is extraordinary.
I am using Fedora 27 linux, but have access to about 20 different linux distros, as
I maintain cinelerra-5.1. I need to build these distros to post deliverables periodically.
The build went very well, and the system is running just fine, but...
It actually takes a little effort to cook up a way to load it to capacity.
I can run a full linux build of Linus Torvalds git repo in about 11 mins, no problems.
Using: make -j200 this saturates the machine for over 10 minutes. Very nice.
However,
If you start 50 background render clients, and run a batch dvd render using the
render farm, I see that it nearly always spontaneously resets (no warning or log messages,
just as if the reset button was pushed) after about 10 minutes. The motherboard is equipped
with IPMI which allows you to monitor "server health" (thermal sensors, voltages, fans).
There are no measured parameters which are even close to any rails. Everything looks
just fine, but it is highly reproducible.
This job does not saturate the machine. It runs at about 85% utilization, probably due
to io delays created by 50 clients accessing media files. It is conspicuous because all
of the kernel panic code outputs all kinds of logging, and tries to resuscitate the machine
in a pretty vigorous way. This does not happen. It is as if the reset button was pushed.
Can a HT sync/reset packet do this?
If anyone in silicon validation would like to try this,
I will be glad to help set up a test case.
This is sort of tricky to setup.
I am a skilled linux developer, and I can set up a kdb session to trap the reset,
but I suspect it is vectoring to the bios reset, not the kernel, and so this may not
be of any help, but I am open to suggestions.
gg
PS: attached: bill_of_materials, dmidecode, lspci