cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

goodguy
Adept II

epyc 7551 spontaneously resets after 10mins rendering

I just finished building a system, dual 7551 epyc cpus using the supermicro H11DSi-NT motherboard.

The build went very well, and the system is running just fine, and the performance is extraordinary.

I am using Fedora 27 linux, but have access to about 20 different linux distros, as

I maintain cinelerra-5.1.  I need to build these distros to post deliverables periodically.

The build went very well, and the system is running just fine, but...

It actually takes a little effort to cook up a way to load it to capacity.

I can run a full linux build of  Linus Torvalds git repo in about 11 mins, no problems.

Using: make -j200   this saturates the machine for over 10 minutes.  Very nice.

However,

If you start 50 background render clients, and run a batch dvd render using the

render farm, I see that it nearly always spontaneously resets (no warning or log messages,

just as if the reset button was pushed) after about 10 minutes.  The motherboard is equipped

with IPMI which allows you to monitor "server health" (thermal sensors, voltages, fans).

There are no measured parameters which are even close to any rails.  Everything looks

just fine, but it is highly reproducible.

This job does not saturate the machine.  It runs at about 85% utilization, probably due

to io delays created by 50 clients accessing media files.  It is conspicuous because all

of the kernel panic code outputs all kinds of logging, and tries to resuscitate the machine

in a pretty vigorous way.  This does not happen.  It is as if the reset button was pushed.

Can a HT sync/reset packet do this?

If anyone in silicon validation would like to try this,

I will be glad to help set up a test case.

This is sort of tricky to setup.

I am a skilled linux developer, and I can set up a kdb session to trap the reset,

but I suspect it is vectoring to the bios reset, not the kernel, and so this may not

be of any help, but I am open to suggestions.

gg

PS: attached: bill_of_materials, dmidecode, lspci

0 Likes
1 Solution
Anonymous
Not applicable

AMD has identified an issue with the Linux cpuidle subsystem whereby a system using a newer kernel(4.13 or newer) with SMT enabled (BIOS default) and global C state control enabled (also BIOS default) may exhibit an unexpected reboot. The likelihood of this reboot is correlated with the frequency of idle events in the system. AMD has released updated system firmware to address this issue. Please contact your system provider for a status on this updated system firmware. Prior to the availability of this updated system firmware, you can work around the issue with the following option:

Boot the kernel with the added command line option idle=nomwait

Thank you goodguy and abucodonosor for providing us with the workload that allowed us to replicate the issue you were experiencing. Also, I would like to recognize koralle for understanding how to implement a workaround in the meantime, independent of our findings and recommendations. 

View solution in original post

56 Replies
Anonymous
Not applicable

Hi abucodonosor,

I wanted to let you know that we are continuing to attempt to replicate your issue, we have tried: multiple kernel compiles and FIO concurrently, different BIOS versions, OS distros, and kernels and we have been unable to reproduce the reported problem.

0 Likes

@jesse_amd

I don't really think the distro matters .. I think is somewho releated to

moderboard/kernel/BIOS  kombo ( and maybe CPU models ? ) and

timing some sort.

Maybe somewho Supermicro is doing something wrong/unusual with the BIOS

since for both , me and goodguy​ the issue is fixed with BIOS 1.0b *and*

disabled 'Global C-state Control' or with SMT turned OFF in BIOS.

Since I can reproduce that really good let me know if I can do something

to help out debug the issue.

0 Likes

The set of steps to create the reset in about 10 or so minutes after procedure started is attached.

It is not as bad to set up as it looks and many steps only have to be set up once.

The procedure was just tested by a rookie on Fedora 27.  (attached)

Anonymous
Not applicable

Thank you for providing this goodguy. I will have our team attempt to replicate the issue to determine what is going on.

0 Likes

Is there already any progress on this? Seeing similar log messages (MC22_Status) on my Supermicro H11DSi-NT + 2x Epyc 7281

0 Likes

Hello,

i have these "system resets" also on a Supermicro Board H11DSU-iN with 2 x 7601 AMD Epyc CPUs and 16x32 GB LRDIMM 2666MHz Micron 36ASF4G72LZ-2G6D1. The setup has the latest BIOS v1.1 and IPMI v. 1.28 and the default BIOS settings (Global C-State=AUTO, Memory Interleaving=AUTO, SMT=AUTO,Core Perf. Boost=AUTO).

The Supermicro distributor can't reproduce the problem or better isn't willing to adjust their test setup. He has seen serious incompatibility problems with NVMe-SSDs of distinct brands, but in my case i have harddisks is usage.

I have reproduced the problem on 2 distributions (Ubuntu 16.04 LTS normal and HWE Kernel and openSuSE 42.3 with kernel 4.4.114) and 2 such systems with a combination with stress-ng- and phoronix-test-suite tests, which produce a high CPU load and a distinct IO load, which is similiar to the mentioned testcase.

For now i have changed to kernel 4.15.9 (openSuSE Leap 42.3), but there i have 2 total IO stucks reproduced : All processes are blocked on harddisk access.

I have tried now the recommendation of goodguy to disable Global C-State. For now after 12 hours the system is stable and has a better IO performance, but with the disadvantage of higher CPU temperatures (+10-15° C) and higher energy consumption in idle state. Before there was a average IO wait of 15 %, now it is under 5 %.

@jesse_amd or to other AMD people : Do you have more informations, why AMD Epyc systems are so unstable in distinct (but not untypical) workloads ?

Is it only related to Supermicro Motherboards or are already experiences present mit Dell ( PowerEdge R7425 )

or HPE (ProLiant DL385 Gen10) AMD Eypc systems ?

Thank you for your answers.

Best, RK

0 Likes

I've found,  that my tests are running stable (+24h) if i change only a setting of the acpi_idle driver with the kernel parameter "idle=nomwait" which prohibits MWAIT to trigger the C1 state. The BIOS has the default settings related to the other working parameters : Global C-State=AUTO, SMT=AUTO. Now only the HLT instruction allows C1 state.

The advantage of this intermediate workaround (which includes the mentioned disabling of C1 state) would be that the boost frequency range could be reached because the required C2 state of a few cores is not disabled.

The usage of MWAIT/MWAITX for AMD was introduced in June 2018 for the vanilla kernel (

x86/ACPI/cstate: Allow ACPI C1 FFH MWAIT use on AMD systems · torvalds/linux@5209654 · GitHub )

Maybe distribution kernels have integrated this patch later and have no problems before activating it. Maybe also a BIOS/AGESA update could enable this feature and allowed linux to use it.

You can check the acpi_idle settings with "cpupower idle-info".

As abucodonosor has suggested, it would be good to check this acpi_idle driver and also the C state handling of the AMD Epyc.

Best, RK

koralle

Oh that makes sense .. I'm going to revert this patch local and do some testing.

@ AMD Team can someone ask  Yazen Ghannam <yazen.ghannam@amd.com>

to have a look since he's the patch author.

0 Likes

I've been running my tests with this setting since weeks without problems. I have created a bug report here : http://bugzilla.suse.com/show_bug.cgi?id=1087490 because Borislav Petkov was also involved in this case.

But i've seen, that on transitions from complete idle state to heavy load, the C6 Core state (acpi_idle sums all deeper states in C2) is responsible for crashes. It could be disabled via a MSR register, but it's rarely happening.

This C6 state is important for reaching the Boost frequency range or reducing power consumption.

koralle

You may want to point Borislav Petkov to this thread also.

I run now an stress test with that patch reverted from 4.16.1..

Looks good so far but I'll let it run at least 24h

0 Likes

abucodonosor : Great that you are testing again !

How is your test/load profile ?

Since this week we are running our "normal" scientific experiments with nearly 100% user time.

My testing load profile was in average values 60% user time, 10 % system time, 10 % iowait, so that C1 is often used.

0 Likes

koralle

I use an 'self made test' , combination of compiling , dd , compressing , deleting,  moving files around.

The sytem has constant 55% - 58% user time , 9% - 11% sys , 8.5% - 12% iowait.

With patch reverted ( which is the same as using the commandline option ) and default Global C-State in BIOS

I run the test for 35h without issues.

0 Likes
Anonymous
Not applicable

AMD has identified an issue with the Linux cpuidle subsystem whereby a system using a newer kernel(4.13 or newer) with SMT enabled (BIOS default) and global C state control enabled (also BIOS default) may exhibit an unexpected reboot. The likelihood of this reboot is correlated with the frequency of idle events in the system. AMD has released updated system firmware to address this issue. Please contact your system provider for a status on this updated system firmware. Prior to the availability of this updated system firmware, you can work around the issue with the following option:

Boot the kernel with the added command line option idle=nomwait

Thank you goodguy and abucodonosor for providing us with the workload that allowed us to replicate the issue you were experiencing. Also, I would like to recognize koralle for understanding how to implement a workaround in the meantime, independent of our findings and recommendations. 

Good News ! But i must say that also disabling C6 is necessary to have a "complete" stable Epyc system.

Is this issue also addressed by the upcoming microcode update ?

If you want to reproduce this problem try extreme load transitions e.g. from total idle system to heavy load.

In my cases i have seen that you need a long running job several days and after going into idle state and again switching to heavy load, the crash is occuring.

I have tried to reproduce it with shorter periods (3 hours load, 15min break, 3 hours load ...), but the crash wasn't happening in this manner.

Best,

RK

0 Likes

Can you provide more details on your assertion that C6 must also be disabled for stability? What platform are you using, what version of platform firmware, which EPYC SKU?

There are some scenarios where disabling CC6 (ACPI OS C2 idle) may be beneficial however we are not aware of any stability issues with EPYC and the CC6 idle state.

You have pointed out what can be a typically worst-case scenario for the platform for any processor. Transitioning from most/all cores in CC6 (core power down) to full speed (P0/C0) all at the same time will generate a fairly large dI/dT that the system power supply must react to without violating the DC specifications for the EPYC processor. This is also one specific area where the platform vendors invest a lot of focus (because it is a worst-case). If a commercially available platform appears to be violating the EPYC electrical specs, we definitely want to know about it.

0 Likes

I've tested on a Supermicro SuperServer AS-2023US-TR4 (with motherboard H11DSU-iN, 2x7601 Epyc CPUs) with the most recent available BIOS version 1.1.

I have found the problem only experimentally and also the stabilization with disabled C6. The interesting thing is that i can only replicate the crash during extreme load changes after a long running job.

0 Likes

Thank you to everyone who worked on this issue and for the update system firmware to fix it.Today I updated the firmware to MBD-H11DSI-NT-B as provided by SuperMicro for my board with the 2 epyc 7551 chips.Then I ran the test as I had outlined previously, which without any mods would crash - usually within 9 seconds to 23 minutes.I successfully ran this same test with the new firmware for about 5 1/2 hours with no failure and feel safe in saying it is corrected.

On Friday, April 13, 2018, 9:24:25 AM MDT, jesse_amd <amd-external@jiveon.com> wrote:

|

Community

|

You have been mentioned

by jesse_amd in Re: epyc 7551 spontaneously resets after 10mins rendering in Community - View jesse_amd's reference to you

AMD has identified an issue with the Linux cpuidle subsystem whereby a system using a newer kernel(4.13 or newer) with SMT enabled (BIOS default) and global C state control enabled (also BIOS default) may exhibit an unexpected reboot. The likelihood of this reboot is correlated with the frequency of idle events in the system. AMD has released updated system firmware to address this issue. Please contact your system provider for a status on this updated system firmware. Prior to the availability of this updated system firmware, you can work around the issue with the following option:

 

Boot the kernel with the added command line option idle=nomwait

 

Thank you goodguy and abucodonosor for providing us with the workload that allowed us to replicate the issue you were experiencing. Also, I would like to recognize koralle for understanding how to implement a workaround in the meantime, independent of our findings and recommendations. 

Participate in the conversation by replying to this email

This email was sent by Community because you are a registered user.

You may unsubscribe instantly from Community, or adjust email frequency in your email preferences

0 Likes