cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

goodguy
Adept II

epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

I just finished building a system, dual 7551 epyc cpus using the supermicro H11DSi-NT motherboard.

The build went very well, and the system is running just fine, and the performance is extraordinary.

I am using Fedora 27 linux, but have access to about 20 different linux distros, as

I maintain cinelerra-5.1.  I need to build these distros to post deliverables periodically.

The build went very well, and the system is running just fine, but...

It actually takes a little effort to cook up a way to load it to capacity.

I can run a full linux build of  Linus Torvalds git repo in about 11 mins, no problems.

Using: make -j200   this saturates the machine for over 10 minutes.  Very nice.

However,

If you start 50 background render clients, and run a batch dvd render using the

render farm, I see that it nearly always spontaneously resets (no warning or log messages,

just as if the reset button was pushed) after about 10 minutes.  The motherboard is equipped

with IPMI which allows you to monitor "server health" (thermal sensors, voltages, fans).

There are no measured parameters which are even close to any rails.  Everything looks

just fine, but it is highly reproducible.

This job does not saturate the machine.  It runs at about 85% utilization, probably due

to io delays created by 50 clients accessing media files.  It is conspicuous because all

of the kernel panic code outputs all kinds of logging, and tries to resuscitate the machine

in a pretty vigorous way.  This does not happen.  It is as if the reset button was pushed.

Can a HT sync/reset packet do this?

If anyone in silicon validation would like to try this,

I will be glad to help set up a test case.

This is sort of tricky to setup.

I am a skilled linux developer, and I can set up a kdb session to trap the reset,

but I suspect it is vectoring to the bios reset, not the kernel, and so this may not

be of any help, but I am open to suggestions.

gg

PS: attached: bill_of_materials, dmidecode, lspci

0 Kudos
1 Solution

Accepted Solutions
jesse_amd
Staff
Staff

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

AMD has identified an issue with the Linux cpuidle subsystem whereby a system using a newer kernel(4.13 or newer) with SMT enabled (BIOS default) and global C state control enabled (also BIOS default) may exhibit an unexpected reboot. The likelihood of this reboot is correlated with the frequency of idle events in the system. AMD has released updated system firmware to address this issue. Please contact your system provider for a status on this updated system firmware. Prior to the availability of this updated system firmware, you can work around the issue with the following option:

Boot the kernel with the added command line option idle=nomwait

Thank you goodguy and abucodonosor for providing us with the workload that allowed us to replicate the issue you were experiencing. Also, I would like to recognize koralle for understanding how to implement a workaround in the meantime, independent of our findings and recommendations. 

View solution in original post

56 Replies
abucodonosor
Adept III

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

Hello,

I have an similar build just with 2 x EPYC 7281 and similar issue.

It seems the system resets itself once load is >80% and I/O >50%.

Looking at that closer the only piece(es) HW can do that are the

watchdogs ( which are not working in Linux anyway right now ) also

maybe the BCM has some sort own watchdog ( can't find any good documentation

for the motherboard ). Also supermicro's manual about the modtherboard is strange.

It looks to me like is a matter of the memory configuration one is using.

4 , 8 , 12 , 16 RAM Modules ( which is completly undocumented in the manual )

and the used SATA/PCI-E/NVME ports.

I use 4 x 32GB right now.

I'm using the internal M.2 port with a 'Samsung SSD 960 EVO 250GB' for *system*

and have a second one in the PCI-e x8. Also 8 X 2TB NAS HDD's , 4 for each CPU SATA port

( using vendors calbles )..

Original configuration looked like this :

4 x 32GB RAM Modules D1/F1 ( like the motherboard manual suggest )

PCI-e x8 CPU1 slot the second NVME

M.2 CPU1 the system NVME

( NVME_0 , NVME_1 port unused )

CPU1-SATA 4x 2TB HDD

CPU2-SATA 4x 2TB HDD

No go with that stressing the system a bit it just reboot itself

after 5 to 10 minutes..

Also turned on edac in kernel and mce and now I see an mce on CPU24

but I don't think that's real since occurs like this:

BCM reports error on Disk18 , SMART Asseration ( huh? I don't have  18 disks ..)

followed by in kernel MCE correctable , eg:

mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:24 (17:1:2) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
[Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000005a00000d
[Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2
[Hardware Error]: Power, Interrupts, etc. Error: Error on GMI link.
[Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)

*and* only occurs on High load , on normal load I never see that..

Now here is what I did to workaround for now:

First get at least an kernel-4.15-rc6 ( this has fixed edac for epyc )

Be sure you have EDAC turned on on kernel config.

On HW site:

power of the box.

pull out any PCI-e cards , any HDDs you don't need but your

HDD/SSD to boot the system.

Power On and in BIOS turn OFF:

Watchdog

IOMMU

ACS

SR-IOV

PCIe Spread Spectrum

Core Performance Boost

Global C-state Control

and any PCI-e/NVME's OPROM's you don't need.

Change:

Determinism Slider to Performance

Memory Clock to 2666Mhz

( if you use UEFI change the remaining OPROM's to EFI )

Save and performe an Power Cycle.

Once the box is UP open IPMI Webinterface.

Change FAN mode to HavyIO

Turn On extra event features.

Here it works as workaround , I stress the box with an loop compiling libreoffice

and the kernel-tree with -j$core_count for near a day now.

I see the mce from time to time and something may be wrong but right now I'm not sure hwo to blame

( PS: you can find me on freenode just PM crazy if you wish )

goodguy
Adept II

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

Wow... this is definitely what I see.

Right down to the "Hardware Error".

Thank you for responding!

  Dec 24 13:27:06 xray.local.net kernel: mce: [Hardware Error]: Machine check events logged
  Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: Corrected error, no action required.
  Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: CPU:40 (17:1:2) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
  Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000005a000009
  Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2
  Dec 24 13:27:06 xray.local.net kernel: [Hardware Error]: Power, Interrupts, etc. Error: Error on GMI link.
  Dec 24 13:27:07 xray.local.net kernel: [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)

My current conjecture is that there are actually 2 problems.

1) The machine reboots with no warning or logging

2) there are single bit errors in at least one bus (cache bus?)

Near the end of each month, I rebuild a package for about 16 distros.  This means that I need to

maintain a wide set of linux versions, and have several past kernels available.  I noticed that

older kernels do not seem to exhibit the "reboot" problem.  Since starting a job and waiting

for a fail is the test procedure, and the time to fail is not predictable, these measurements are

"best guess" results.  I have not run long term tests.  Each test was less than 1 hour of stress.

I started with older fedora kernels, and worked towards the present deliverable.

4.8.14 ok, 4.11.12 ok, 4.13.12 ok, 4.13.15 fails, 4.14.7 fails.

I cannot tell what the controlling variables are.  Running a kernel build saturates the

the cpu utilization, but does not fail.  Running render jobs is about 80% cpu, and fails.

Perhaps IO utilization is a controlling variable.

The mce errors do not seem to be load related.  The cooler on the cpus are

arctic freezer 240 and I have monitored the cpu temperatures with IPMI while

operating the high load tests.  The top temp was 50 deg C.  The IPMI indicates

rails of 100 deg C, not even close.

Once, while trying to configure the X server, not even under a load, I left up a hung

startx while I chased down log errors.  While it was just sitting there, I got a mce

hw error.

I would like to add that the advertisements I used to buy the CPUs

AMD PS7551BDAFWOF Epyc Model 7551 32C 3.0G 64MB 180W 2666MHZ

that the cpus will run at 2666, and the actual rates are about 2000.  This is what

you see from the BIOS and /proc/cpuinfo, but it is hard to tell for sure what is

really being used, since it varies over time.  This is a little disappointing, since

some steps are serialized, and long jobs perform badly at low cpu clock rates.

> Now here is what I did to workaround for now:

> First get at least an kernel-4.15-rc6 ( this has fixed edac for epyc )

> Be sure you have EDAC turned on on kernel config.

I am currently trying to cobble together a 4.15.x kernel, and I will definitely try this.

Thank you for the suggestions.  More to follow (probably).

gg

abucodonosor
Adept III

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

Not sure what distro you are running , however to build the kernel simple use the distro config.

Download the kernel tarball you'll like , unpack in , cd in there , then :

cp the_location_of_distro_config ( see /boot and or /proc ) .config

make oldconfig

make prepare

make -j128 V=1

sudo make modules_install

sudo cp System.map /boot/System.map-$kerneluname ( take from DEPMOD the exact version )
sudo cp arch/x86/boot/bzImage /boot/vmlinuz-$kerneluname

for dracut initramfs generator just run :

sudo dracut -f --kver $kerneluname

and finally re-create grub.cfg by doing :

sudo grub-mkconfig -o /boot/grub/grub.cfg

I tested now offlining CPU24 ( the one with the bit errors ) errors seems to stoped but I see on

the IPMI webinterface event.log the same 'Disk18 SMART error' which just cannot be true since here

is no way to even have such much Disks without using the 2 NVME ports.. so that is something very strange..

Maybe we are very unlucky or there is something else going on ?

BTW if you try offlining on 414.x && 4.15-rc6 there is an BUG by now in BLK MQ code .. before building

open block/blk-mq.c line 1209 , there is a buggy WARN_ON() at least for EPYC CPUs .. just comment it out.

Also my testing setups still runs with the workarounds I've posted >20 full libreoffice builds ,

including checks , dicts , langs etc .. and near 54 kernel builds ( allmodconfig ) .. both build as distro packages

so I generate a lot more I/O by removing / installing in chroots etc..

abucodonosor
Adept III

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

I don't think is a cooling problem my CPUs are +/-  37 deg C , system temp +/- 45 deg C

The only temp is higher is the MB_10G Temp is sometimes over 63 deg C and anything else is far away to hit >50  deg C.

Btw do you have any errors in your event.log ( IMPI ) and SMBIOS log in BIOS ?

It seems I hit 0x90 = Unknow CPU ?! in SMBIOS log...

goodguy
Adept II

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

I did upgrade the BIOS to the latest 1.0a and Yes, I am getting very similar errors.

IPMI log errors are:

EID          Time Stamp          Sensor Name     Sensor Type          Description

7          2017/12/24 18:47:05     OEM                    HDD     Disk 17 SMART failure - Assertion

8          2017/12/24 20:24:04     OEM                    HDD     Disk17 SMART failure - Assertion

9          2017/12/27 18:49:17     OEM                    HDD     Disk17 SMART failure - Assertion

10        2017/12/28 16:41:05                                               System Firmware Progress     System Firmware Error (POST Error) - Assertion

11        2017/12/28 16:42:50                                               System Firmware Progress     System Firmware Error (POST Error) - Assertion

12        2017/12/29 21:27:00     OEM                    HDD     Disk17 SMART failure - Assertion

13        2018/01/01 20:50:53     OEM                    HDD     DIsk17 SMART failure - Assertion

This data was from the IPMI log, but if you look at the bios SMBIOS log,

this is apparently at the same time (12/29 and 01/01) but as reported from the SMBIOS event log:

12/29/17 21:22:02 SMBIOS 0x90 N/A   Description: unspecified processor / unrecognized

01/01/18 20:50:54 (the same, but only 2 of these over days)

It looks like it may be misreporting MCE hardware errors as SMART Disk errors.

There are other errors that seem to be incorrect as:

[ 0.789694] pnp: PnP ACPI: found 6 devices
[0.799435] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[0.799689] pci 0000:01:00.0: BAR 7: no space for [mem size 0x00100000 64bit]
[0.799843] pci 0000:01:00.0: BAR 7: failed to assign [mem size 0x00100000 64bit]
[0.800061] pci 0000:01:00.0: BAR 10: no space for [mem size 0x00100000 64bit]
[0.800264] pci 0000:01:00.0: BAR 10: failed to assign [mem size 0x00100000 64bit]
[0.800478] pci 0000:01:00.1: BAR 7: no space for [mem size 0x00100000 64bit]
[0.800676] pci 0000:01:00.1: BAR 7: failed to assign [mem size 0x00100000 64bit]
[0.800946] pci 0000:01:00.1: BAR 10: no space for [mem size 0x00100000 64bit]
[0.801228] pci 0000:01:00.1: BAR 10: failed to assign [mem size 0x00100000 64bit]
[0.801500] pci 0000:00:01.1: PCI bridge to [bus 01]
[0.801683] pci 0000:00:01.1:   bridge window [mem 0xef400000-0xef9fffff]
[0.801885] pci 0000:00:07.1: PCI bridge to [bus 02]
[0.802078] pci 0000:00:07.1:   bridge window [mem 0xefa00000-0xefcfffff]
[0.802277] pci 0000:00:08.1: PCI bridge to [bus 03]

and

[   15.079092] ipmi_si dmi-ipmi-si.0: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.

gg

0 Kudos
abucodonosor
Adept III

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

Near same here .. I don't have the Post errors and Disk 17 is Disk18 here .. also same 0x90 in BIOS log.

Also the same kernel warnings ..

If you look closer the kernel re-assigns the BARS  later on but..

I'll  play with the PCIe Link Options in BIOS later..

Do you have something could be 'Disk17' ?:) I don't know what kind error that is since here

is very clear .. I don't have 18 Disks.. Something seems very confused in some firmware..

@AMD guys ping ?:) Any ideas what this may be ?

0 Kudos
jesse_amd
Staff
Staff

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

Hello abucodonosor and goodguy,

Thank you for submitting your issue to the community and helping us understand what is going on. I am currently working on getting you a solution. Please stay tuned.

0 Kudos
abucodonosor
Adept III

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

@jesse_amd

Thx for looking into this. If you need any infos please let me know ..

I can test any kind patches , firmware updates and so on..

@goodguy

can you test the following :

From IPMI Power Cycle the box.

Got into your BIOS and set  'PCIe Link Training Type' from 1 to 2 , save and reboot.

Run your 'kill_the_box_setup' .. see whatever mce still occurs .. if yes , add the following to kernel command line:

isolcpus=40

Also can you tell me how your disks layout looks like ( used and unused disks ) ?

linux_monkey
Staff
Staff

Re: epyc 7551 spontaneously resets after 10mins rendering

Jump to solution

@goodguy, the most common cause I have seen of rogue reboots is DDR4 DIMMs that are not approved by the PCB manufacturer. If the specific DIMM you are using (Micron 18ASF2G72PDZ-2G6H1R from your dmidecode output) has not been tested and approved by SuperMicro, that is the first change I would suggest.

Beyond that, prudent debug steps to narrow the problem would be the following:

1) Set the memory clock speed to 2400 MHz in the System Settings

2) Disable SMT in System Settings (note this will reduce the number of CPUs so you may have to fix any scripts assigning affinity or a thread count)

3) Boot your system with the added kernel command line parameter processor_idle.max_cstate=0

I am not suggesting these as work-arounds, only as steps to narrow down the problem scope.

Note the MCE22 error applies to the Infinity Fabric interconnect between the two sockets. It is not likely related to or the cause of a rogue reboot. Occasional corrected errors do occur on this interface, however the error itself is not a concern and is not typically a canary of other system problems. That said, if either of the above changes increases or decreases the frequency of the error, that would be interesting to know.

Regards,


Lewis

0 Kudos