alibek

[SOLVED] Uhhuh. NMI received for unknown reason

Discussion created by alibek on Nov 14, 2018

Error:

[Tue Nov 13 14:35:35 2018] Uhhuh. NMI received for unknown reason 21 on CPU 84.

[Tue Nov 13 14:35:35 2018] Do you have a strange power saving mode enabled?

[Tue Nov 13 14:35:35 2018] Dazed and confused, but trying to continue

 

Hardware:

CPU 2x AMD EPYC 7601 on Supermicro H11DST-B (Version: 1.01) 2123BT-HNC0R
with BIOS Version: 1.1a (Release Date: 10/04/2018)

RAM 2TiB 2666 MHz

 

Software:

OC  Linux 4.15.18-7-pve #1 SMP PVE 4.15.18-27 (Wed, 10 Oct 2018 10:50:11 +0200) x86_64 GNU/Linux (Debian GNU/Linux 9.5 (stretch))

 

How to reproduce:

# apt install linux-tools-4.15

# dpkg -S $(which perf)

linux-base: /usr/bin/perf

# dmesg -T | tail -f

run in other console:

# perf top

 

I try to disable nmi_watchdog:

# cat /etc/modprobe.d/nmi-watchdog-blacklist.conf

blacklist iTCO_wdt

blacklist iTCO_vendor_support

 

# grep 'Command line' /var/log/kern.log

Nov 14 19:13:19 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet pcie_aspm=off

Nov 14 19:30:49 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet pcie_aspm=off nmi_watchdog=0

Nov 14 19:56:51 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet nmi_watchdog=0 pcie_aspm=off

Nov 14 20:38:15 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet nmi_watchdog=0 pcie_aspm=off idle=nomwait

Nov 14 21:10:49 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet nmi_watchdog=0 pcie_aspm=off idle=nomwait

 

I try to change governon (ondemand to perfrormance):

# for c in {0..127}; do cpufreq-set -g performance -c $c; done

 

But error still preset (on all 4 nodes in server platform)

 

Solution:

I disable c-states in BIOS and error is gone.

 

Note:

I think in linux kernel need add support new NMI of EPYC SoC

Outcomes