0 Replies Latest reply on Nov 16, 2018 9:21 AM by alibek

    [SOLVED] Uhhuh. NMI received for unknown reason

    alibek

      Error:

      [Tue Nov 13 14:35:35 2018] Uhhuh. NMI received for unknown reason 21 on CPU 84.

      [Tue Nov 13 14:35:35 2018] Do you have a strange power saving mode enabled?

      [Tue Nov 13 14:35:35 2018] Dazed and confused, but trying to continue

       

      Hardware:

      CPU 2x AMD EPYC 7601 on Supermicro H11DST-B (Version: 1.01) 2123BT-HNC0R
      with BIOS Version: 1.1a (Release Date: 10/04/2018)

      RAM 2TiB 2666 MHz

       

      Software:

      OC  Linux 4.15.18-7-pve #1 SMP PVE 4.15.18-27 (Wed, 10 Oct 2018 10:50:11 +0200) x86_64 GNU/Linux (Debian GNU/Linux 9.5 (stretch))

       

      How to reproduce:

      # apt install linux-tools-4.15

      # dpkg -S $(which perf)

      linux-base: /usr/bin/perf

      # dmesg -T | tail -f

      run in other console:

      # perf top

       

      I try to disable nmi_watchdog:

      # cat /etc/modprobe.d/nmi-watchdog-blacklist.conf

      blacklist iTCO_wdt

      blacklist iTCO_vendor_support

       

      # grep 'Command line' /var/log/kern.log

      Nov 14 19:13:19 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet pcie_aspm=off

      Nov 14 19:30:49 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet pcie_aspm=off nmi_watchdog=0

      Nov 14 19:56:51 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet nmi_watchdog=0 pcie_aspm=off

      Nov 14 20:38:15 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet nmi_watchdog=0 pcie_aspm=off idle=nomwait

      Nov 14 21:10:49 host1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-7-pve root=UUID=dfa0b70c-f4e7-4fc3-85ef-e6ddbb288091 ro quiet nmi_watchdog=0 pcie_aspm=off idle=nomwait

       

      I try to change governon (ondemand to perfrormance):

      # for c in {0..127}; do cpufreq-set -g performance -c $c; done

       

      But error still preset (on all 4 nodes in server platform)

       

      Solution:

      I disable c-states in BIOS and error is gone.

       

      Note:

      I think in linux kernel need add support new NMI of EPYC SoC