I just finished building a system, dual 7551 epyc cpus using the supermicro H11DSi-NT motherboard.
The build went very well, and the system is running just fine, and the performance is extraordinary.
I am using Fedora 27 linux, but have access to about 20 different linux distros, as
I maintain cinelerra-5.1. I need to build these distros to post deliverables periodically.
The build went very well, and the system is running just fine, but...
It actually takes a little effort to cook up a way to load it to capacity.
I can run a full linux build of Linus Torvalds git repo in about 11 mins, no problems.
Using: make -j200 this saturates the machine for over 10 minutes. Very nice.
However,
If you start 50 background render clients, and run a batch dvd render using the
render farm, I see that it nearly always spontaneously resets (no warning or log messages,
just as if the reset button was pushed) after about 10 minutes. The motherboard is equipped
with IPMI which allows you to monitor "server health" (thermal sensors, voltages, fans).
There are no measured parameters which are even close to any rails. Everything looks
just fine, but it is highly reproducible.
This job does not saturate the machine. It runs at about 85% utilization, probably due
to io delays created by 50 clients accessing media files. It is conspicuous because all
of the kernel panic code outputs all kinds of logging, and tries to resuscitate the machine
in a pretty vigorous way. This does not happen. It is as if the reset button was pushed.
Can a HT sync/reset packet do this?
If anyone in silicon validation would like to try this,
I will be glad to help set up a test case.
This is sort of tricky to setup.
I am a skilled linux developer, and I can set up a kdb session to trap the reset,
but I suspect it is vectoring to the bios reset, not the kernel, and so this may not
be of any help, but I am open to suggestions.
gg
PS: attached: bill_of_materials, dmidecode, lspci
Solved! Go to Solution.
AMD has identified an issue with the Linux cpuidle subsystem whereby a system using a newer kernel(4.13 or newer) with SMT enabled (BIOS default) and global C state control enabled (also BIOS default) may exhibit an unexpected reboot. The likelihood of this reboot is correlated with the frequency of idle events in the system. AMD has released updated system firmware to address this issue. Please contact your system provider for a status on this updated system firmware. Prior to the availability of this updated system firmware, you can work around the issue with the following option:
Boot the kernel with the added command line option idle=nomwait
Thank you goodguy and abucodonosor for providing us with the workload that allowed us to replicate the issue you were experiencing. Also, I would like to recognize koralle for understanding how to implement a workaround in the meantime, independent of our findings and recommendations.
Hello,
I have an similar build just with 2 x EPYC 7281 and similar issue.
It seems the system resets itself once load is >80% and I/O >50%.
Looking at that closer the only piece(es) HW can do that are the
watchdogs ( which are not working in Linux anyway right now ) also
maybe the BCM has some sort own watchdog ( can't find any good documentation
for the motherboard ). Also supermicro's manual about the modtherboard is strange.
It looks to me like is a matter of the memory configuration one is using.
4 , 8 , 12 , 16 RAM Modules ( which is completly undocumented in the manual )
and the used SATA/PCI-E/NVME ports.
I use 4 x 32GB right now.
I'm using the internal M.2 port with a 'Samsung SSD 960 EVO 250GB' for *system*
and have a second one in the PCI-e x8. Also 8 X 2TB NAS HDD's , 4 for each CPU SATA port
( using vendors calbles )..
Original configuration looked like this :
4 x 32GB RAM Modules D1/F1 ( like the motherboard manual suggest )
PCI-e x8 CPU1 slot the second NVME
M.2 CPU1 the system NVME
( NVME_0 , NVME_1 port unused )
CPU1-SATA 4x 2TB HDD
CPU2-SATA 4x 2TB HDD
No go with that stressing the system a bit it just reboot itself
after 5 to 10 minutes..
Also turned on edac in kernel and mce and now I see an mce on CPU24
but I don't think that's real since occurs like this:
BCM reports error on Disk18 , SMART Asseration ( huh? I don't have 18 disks ..)
followed by in kernel MCE correctable , eg:
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:24 (17:1:2) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
[Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000005a00000d
[Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2
[Hardware Error]: Power, Interrupts, etc. Error: Error on GMI link.
[Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)
*and* only occurs on High load , on normal load I never see that..
Now here is what I did to workaround for now:
First get at least an kernel-4.15-rc6 ( this has fixed edac for epyc )
Be sure you have EDAC turned on on kernel config.
On HW site:
power of the box.
pull out any PCI-e cards , any HDDs you don't need but your
HDD/SSD to boot the system.
Power On and in BIOS turn OFF:
Watchdog
IOMMU
ACS
SR-IOV
PCIe Spread Spectrum
Core Performance Boost
Global C-state Control
and any PCI-e/NVME's OPROM's you don't need.
Change:
Determinism Slider to Performance
Memory Clock to 2666Mhz
( if you use UEFI change the remaining OPROM's to EFI )
Save and performe an Power Cycle.
Once the box is UP open IPMI Webinterface.
Change FAN mode to HavyIO
Turn On extra event features.
Here it works as workaround , I stress the box with an loop compiling libreoffice
and the kernel-tree with -j$core_count for near a day now.
I see the mce from time to time and something may be wrong but right now I'm not sure hwo to blame
( PS: you can find me on freenode just PM crazy if you wish )
Wow... this is definitely what I see.
Right down to the "Hardware Error".
Thank you for responding!
Dec 24 13:27:06 xray.local.net kernel: mce: [Hardware Error]: Machine check events logged
Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: Corrected error, no action required.
Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: CPU:40 (17:1:2) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000005a000009
Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2
Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: Power, Interrupts, etc. Error: Error on GMI link.
Dec 24 13:27:07 xray.local.net kernel: [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)
My current conjecture is that there are actually 2 problems.
1) The machine reboots with no warning or logging
2) there are single bit errors in at least one bus (cache bus?)
Near the end of each month, I rebuild a package for about 16 distros. This means that I need to
maintain a wide set of linux versions, and have several past kernels available. I noticed that
older kernels do not seem to exhibit the "reboot" problem. Since starting a job and waiting
for a fail is the test procedure, and the time to fail is not predictable, these measurements are
"best guess" results. I have not run long term tests. Each test was less than 1 hour of stress.
I started with older fedora kernels, and worked towards the present deliverable.
4.8.14 ok, 4.11.12 ok, 4.13.12 ok, 4.13.15 fails, 4.14.7 fails.
I cannot tell what the controlling variables are. Running a kernel build saturates the
the cpu utilization, but does not fail. Running render jobs is about 80% cpu, and fails.
Perhaps IO utilization is a controlling variable.
The mce errors do not seem to be load related. The cooler on the cpus are
arctic freezer 240 and I have monitored the cpu temperatures with IPMI while
operating the high load tests. The top temp was 50 deg C. The IPMI indicates
rails of 100 deg C, not even close.
Once, while trying to configure the X server, not even under a load, I left up a hung
startx while I chased down log errors. While it was just sitting there, I got a mce
hw error.
I would like to add that the advertisements I used to buy the CPUs
AMD PS7551BDAFWOF Epyc Model 7551 32C 3.0G 64MB 180W 2666MHZ
that the cpus will run at 2666, and the actual rates are about 2000. This is what
you see from the BIOS and /proc/cpuinfo, but it is hard to tell for sure what is
really being used, since it varies over time. This is a little disappointing, since
some steps are serialized, and long jobs perform badly at low cpu clock rates.
> Now here is what I did to workaround for now:
> First get at least an kernel-4.15-rc6 ( this has fixed edac for epyc )
> Be sure you have EDAC turned on on kernel config.
I am currently trying to cobble together a 4.15.x kernel, and I will definitely try this.
Thank you for the suggestions. More to follow (probably).
gg
Not sure what distro you are running , however to build the kernel simple use the distro config.
Download the kernel tarball you'll like , unpack in , cd in there , then :
cp the_location_of_distro_config ( see /boot and or /proc ) .config
make oldconfig
make prepare
make -j128 V=1
sudo make modules_install
sudo cp System.map /boot/System.map-$kerneluname ( take from DEPMOD the exact version )
sudo cp arch/x86/boot/bzImage /boot/vmlinuz-$kerneluname
for dracut initramfs generator just run :
sudo dracut -f --kver $kerneluname
and finally re-create grub.cfg by doing :
sudo grub-mkconfig -o /boot/grub/grub.cfg
I tested now offlining CPU24 ( the one with the bit errors ) errors seems to stoped but I see on
the IPMI webinterface event.log the same 'Disk18 SMART error' which just cannot be true since here
is no way to even have such much Disks without using the 2 NVME ports.. so that is something very strange..
Maybe we are very unlucky or there is something else going on ?
BTW if you try offlining on 414.x && 4.15-rc6 there is an BUG by now in BLK MQ code .. before building
open block/blk-mq.c line 1209 , there is a buggy WARN_ON() at least for EPYC CPUs .. just comment it out.
Also my testing setups still runs with the workarounds I've posted >20 full libreoffice builds ,
including checks , dicts , langs etc .. and near 54 kernel builds ( allmodconfig ) .. both build as distro packages
so I generate a lot more I/O by removing / installing in chroots etc..
I don't think is a cooling problem my CPUs are +/- 37 deg C , system temp +/- 45 deg C
The only temp is higher is the MB_10G Temp is sometimes over 63 deg C and anything else is far away to hit >50 deg C.
Btw do you have any errors in your event.log ( IMPI ) and SMBIOS log in BIOS ?
It seems I hit 0x90 = Unknow CPU ?! in SMBIOS log...
I did upgrade the BIOS to the latest 1.0a and Yes, I am getting very similar errors.
IPMI log errors are:
EID Time Stamp Sensor Name Sensor Type Description
7 2017/12/24 18:47:05 OEM HDD Disk 17 SMART failure - Assertion
8 2017/12/24 20:24:04 OEM HDD Disk17 SMART failure - Assertion
9 2017/12/27 18:49:17 OEM HDD Disk17 SMART failure - Assertion
10 2017/12/28 16:41:05 System Firmware Progress System Firmware Error (POST Error) - Assertion
11 2017/12/28 16:42:50 System Firmware Progress System Firmware Error (POST Error) - Assertion
12 2017/12/29 21:27:00 OEM HDD Disk17 SMART failure - Assertion
13 2018/01/01 20:50:53 OEM HDD DIsk17 SMART failure - Assertion
This data was from the IPMI log, but if you look at the bios SMBIOS log,
this is apparently at the same time (12/29 and 01/01) but as reported from the SMBIOS event log:
12/29/17 21:22:02 SMBIOS 0x90 N/A Description: unspecified processor / unrecognized
01/01/18 20:50:54 (the same, but only 2 of these over days)
It looks like it may be misreporting MCE hardware errors as SMART Disk errors.
There are other errors that seem to be incorrect as:
[ 0.789694] pnp: PnP ACPI: found 6 devices | |
[ | 0.799435] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns |
[ | 0.799689] pci 0000:01:00.0: BAR 7: no space for [mem size 0x00100000 64bit] |
[ | 0.799843] pci 0000:01:00.0: BAR 7: failed to assign [mem size 0x00100000 64bit] |
[ | 0.800061] pci 0000:01:00.0: BAR 10: no space for [mem size 0x00100000 64bit] |
[ | 0.800264] pci 0000:01:00.0: BAR 10: failed to assign [mem size 0x00100000 64bit] |
[ | 0.800478] pci 0000:01:00.1: BAR 7: no space for [mem size 0x00100000 64bit] |
[ | 0.800676] pci 0000:01:00.1: BAR 7: failed to assign [mem size 0x00100000 64bit] |
[ | 0.800946] pci 0000:01:00.1: BAR 10: no space for [mem size 0x00100000 64bit] |
[ | 0.801228] pci 0000:01:00.1: BAR 10: failed to assign [mem size 0x00100000 64bit] |
[ | 0.801500] pci 0000:00:01.1: PCI bridge to [bus 01] |
[ | 0.801683] pci 0000:00:01.1: bridge window [mem 0xef400000-0xef9fffff] |
[ | 0.801885] pci 0000:00:07.1: PCI bridge to [bus 02] |
[ | 0.802078] pci 0000:00:07.1: bridge window [mem 0xefa00000-0xefcfffff] |
[ | 0.802277] pci 0000:00:08.1: PCI bridge to [bus 03] |
and
[ 15.079092] ipmi_si dmi-ipmi-si.0: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.
gg
Near same here .. I don't have the Post errors and Disk 17 is Disk18 here .. also same 0x90 in BIOS log.
Also the same kernel warnings ..
If you look closer the kernel re-assigns the BARS later on but..
I'll play with the PCIe Link Options in BIOS later..
Do you have something could be 'Disk17' ?:) I don't know what kind error that is since here
is very clear .. I don't have 18 Disks.. Something seems very confused in some firmware..
@AMD guys ping ?:) Any ideas what this may be ?
Hello abucodonosor and goodguy,
Thank you for submitting your issue to the community and helping us understand what is going on. I am currently working on getting you a solution. Please stay tuned.
@jesse_amd
Thx for looking into this. If you need any infos please let me know ..
I can test any kind patches , firmware updates and so on..
@goodguy
can you test the following :
From IPMI Power Cycle the box.
Got into your BIOS and set 'PCIe Link Training Type' from 1 to 2 , save and reboot.
Run your 'kill_the_box_setup' .. see whatever mce still occurs .. if yes , add the following to kernel command line:
isolcpus=40
Also can you tell me how your disks layout looks like ( used and unused disks ) ?
@goodguy, the most common cause I have seen of rogue reboots is DDR4 DIMMs that are not approved by the PCB manufacturer. If the specific DIMM you are using (Micron 18ASF2G72PDZ-2G6H1R from your dmidecode output) has not been tested and approved by SuperMicro, that is the first change I would suggest.
Beyond that, prudent debug steps to narrow the problem would be the following:
1) Set the memory clock speed to 2400 MHz in the System Settings
2) Disable SMT in System Settings (note this will reduce the number of CPUs so you may have to fix any scripts assigning affinity or a thread count)
3) Boot your system with the added kernel command line parameter processor_idle.max_cstate=0
I am not suggesting these as work-arounds, only as steps to narrow down the problem scope.
Note the MCE22 error applies to the Infinity Fabric interconnect between the two sockets. It is not likely related to or the cause of a rogue reboot. Occasional corrected errors do occur on this interface, however the error itself is not a concern and is not typically a canary of other system problems. That said, if either of the above changes increases or decreases the frequency of the error, that would be interesting to know.
Regards,
Lewis