7 Replies Latest reply on Jan 2, 2018 10:00 PM by fermulator

    amdgpu-pro 17.30 linux "radeon_setup_kernel_mem failed

    fermulator

      Posted this in IRC as well (freenode, #amdgpu)

       

      intro MANDATORY information:

      ----

      • AMD Graphics Card
        • Hawaii XT AMD Radeon R9 290X
      • Desktop or Laptop System
        • desktop
      • Operating System
        • Ubuntu 16.04.3 LTS 64-bit - Linux fermmy 4.4.0-96-generic #119-Ubuntu SMP Tue Sep 12 14:59:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
      • Driver version installed
        • see attached (17.30)
      • Display Devices
        • dual monitor, 1680x1050 each
      • Motherboard + Bios Revision
        • SABERTOOTH 990FX R2.0
      • CPU/APU
        • AMD FX(tm)-8350 Eight-Core Processor

       

      Description of problem:

      ---

      yesterday I upgraded from v17.10 to v17.30 amdgpu-pro Linux driver; (followed How-To Install/Uninstall AMD Radeon™ Software AMDGPU-PRO Driver for Linux® on an Ubuntu System )

       

      --I use two primary local user accounts, my PRIMARY one worked fine (have been using it) - however this morning tried to login to my other account and it fails Xorg init with "radeon_setup_kernel_mem failed" - snippet of info:  https://pastebin.com/f0SinVDR (see session ID 17873)

       

      (a "symptomatic" description of the problem - this procedure worked fine yesterday before driver upgrade)

      1. from active PRIMARY user account running gnome3, go "switch user"

      2. select the other SECONDARY user account, enter password, GO

      3. GDM flashes for a moment, then bails back to the login screen listing all the users

        (rinse and repeat) - analyzed logs and found the below

       

      attached also [amdgpu-pro_xorg_user_log.txt]

       

      SNIPPET:

      {{{

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) Loading sub module "ramdac"

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) LoadModule: "ramdac"

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) Module "ramdac" already built-in

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) UnloadModule: "modesetting"

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) Unloading modesetting

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) UnloadModule: "fbdev"

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) Unloading fbdev

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) UnloadSubModule: "fbdevhw"

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) Unloading fbdevhw

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) UnloadModule: "vesa"

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) Unloading vesa

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (--) Depth 24 pixmap format is 32 bpp

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) RADEON(0): [DRI2] Setup complete

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) RADEON(0): [DRI2]   DRI driver: radeonsi

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (II) RADEON(0): [DRI2]   VDPAU driver: radeonsi

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (EE) RADEON(0): failed to initialise surface manager

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (EE) RADEON(0): radeon_setup_kernel_mem failed

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (EE)

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: Fatal server error:

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (EE) AddScreen/ScreenInit failed for driver 0

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (EE)

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (EE)

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: Please consult the The X.Org Foundation support

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: #011 at http://wiki.x.org

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]:  for help.

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (EE) Please also check the log file at "/home/<SECONDARY_USER>/.local/share/xorg/Xorg.2.log" for additional information.

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (EE)

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: (EE) Server terminated with error (1). Closing log file.

      Sep 25 08:38:04 fermmy /usr/lib/gdm3/gdm-x-session[17873]: Unable to run X server

      }}}

       

      I then proceeded to initiate a call with AMD technical support (CANADA) and acquired a ticket number.

       

      Proceeding with the next steps of debug (which I wanted to do anyway just hadn't done yet)...

       

      ---

      Logged out of my PRIMARY account, and tried to log back in.

      --> FAIL (same symptoms/logs as above)

       

      SO, unrelated at all to accounts. (makes much more sense now)

       

      ---

      Rebooted

      --> FAIL - system stuck (black screen), dug into the syslog from that time, indeed, same thing

      {{{

      Sep 25 09:25:01 fermmy /usr/lib/gdm3/gdm-x-session[12260]: (II) RADEON(0): [DRI2] Setup complete

      Sep 25 09:25:01 fermmy /usr/lib/gdm3/gdm-x-session[12260]: (II) RADEON(0): [DRI2]   DRI driver: radeonsi

      Sep 25 09:25:01 fermmy /usr/lib/gdm3/gdm-x-session[12260]: (II) RADEON(0): [DRI2]   VDPAU driver: radeonsi

      Sep 25 09:25:01 fermmy /usr/lib/gdm3/gdm-x-session[12260]: (EE) RADEON(0): failed to initialise surface manager

      Sep 25 09:25:01 fermmy /usr/lib/gdm3/gdm-x-session[12260]: (EE) RADEON(0): radeon_setup_kernel_mem failed

      Sep 25 09:25:01 fermmy /usr/lib/gdm3/gdm-x-session[12260]: (EE)

      Sep 25 09:25:01 fermmy /usr/lib/gdm3/gdm-x-session[12260]: Fatal server error:

      Sep 25 09:25:01 fermmy /usr/lib/gdm3/gdm-x-session[12260]: (EE) AddScreen/ScreenInit failed for driver 0

      }}}

       

      ---

      Then dropped into recovery mode, enabled networking + rw on the OS partition, and ran the amdgpu-pro-uninstall, and it removed all the bits.

      , rebooted, system is now on open source driver (but functional)

      {{{

      $ sudo lshw -C video | grep driver

             configuration: driver=radeon latency=0

      }}}

      ---

       

      As per instructions from support, I will attempt to re-install the driver next and update.

       

      Message was edited by: Fermulator (updated with more debug/troubleshooting information)

        • Re: amdgpu-pro 17.30 linux "radeon_setup_kernel_mem failed
          fermulator

          Next step is re-installation of the driver;

           

          again, following How-To Install/Uninstall AMD Radeon™ Software AMDGPU-PRO Driver for Linux® on an Ubuntu System

           

          confirming system is still fully updated (apt update; upgrade; dist-upgrade) - no diff from yesterday

           

          $ dpkg -l amdgpu-pro

          dpkg-query: no packages found matching amdgpu-pro

           

          also confirmed that the two user accounts are members of the "video" group by checking

          $ grep video /etc/group

          <PASS>

           

          Re-executing the installation script

          {{{

          fermulator@fermmy:/usr/local/src/amdgpu-pro/amdgpu-pro-17.30-465504$ sudo ./amdgpu-pro-install -y

           

          (snip)

           

          The following additional packages will be installed:

            amdgpu-pro-core clinfo-amdgpu-pro ids-amdgpu-pro libdrm-amdgpu-pro-amdgpu1 libdrm-amdgpu-pro-amdgpu1:i386 libdrm2-amdgpu-pro:i386 libdrm2-amdgpu-pro libegl1-amdgpu-pro

            libegl1-amdgpu-pro:i386 libgbm1-amdgpu-pro libgbm1-amdgpu-pro:i386 libgbm1-amdgpu-pro-base libgl1-amdgpu-pro-appprofiles libgl1-amdgpu-pro-dri libgl1-amdgpu-pro-dri:i386

            libgl1-amdgpu-pro-ext libgl1-amdgpu-pro-glx libgl1-amdgpu-pro-glx:i386 libgles2-amdgpu-pro libgles2-amdgpu-pro:i386 libopencl1-amdgpu-pro libopencl1-amdgpu-pro:i386

            libvdpau-amdgpu-pro libvdpau-amdgpu-pro:i386 libvdpau1:i386 mesa-vdpau-drivers:i386 opencl-amdgpu-pro-icd opencl-amdgpu-pro-icd:i386 vdpau-driver-all:i386 vulkan-amdgpu-pro

            vulkan-amdgpu-pro:i386 xserver-xorg-video-amdgpu-pro xserver-xorg-video-glamoregl-amdgpu-pro

          }}}

           

          it goes ahead and builds dkms etc. etc. - no visible errors/problems

          {{{

          Setting up amdgpu-pro-core (17.30-465504) ...

          Setting up amdgpu-pro-dkms (17.30-465504) ...

          Loading new amdgpu-pro-17.30-465504 DKMS files...

          First Installation: checking all kernels...

          Building only for 4.4.0-96-generic

          Building for architecture x86_64

          Building initial module for 4.4.0-96-generic

          Done.

          Forcing installation of amdgpu-pro

           

          amdgpu:

          Running module version sanity check.

          - Original module

          - Installation

             - Installing to /lib/modules/4.4.0-96-generic/updates/dkms/

           

          amdttm.ko:

          Running module version sanity check.

          - Original module

          - Installation

             - Installing to /lib/modules/4.4.0-96-generic/updates/dkms/

           

          amdkcl.ko:

          Running module version sanity check.

          - Original module

          - Installation

             - Installing to /lib/modules/4.4.0-96-generic/updates/dkms/

           

          amdkfd.ko:

          Running module version sanity check.

          - Original module

          - Installation

             - Installing to /lib/modules/4.4.0-96-generic/updates/dkms/

           

          depmod....

           

          Backing up initrd.img-4.4.0-96-generic to /boot/initrd.img-4.4.0-96-generic.old-dkms

          Making new initrd.img-4.4.0-96-generic

          (If next boot fails, revert to initrd.img-4.4.0-96-generic.old-dkms image)

          update-initramfs....

           

          DKMS: install completed.

          Setting up libdrm2-amdgpu-pro:amd64 (1:2.4.70-465504) ...

          Setting up libdrm2-amdgpu-pro:i386 (1:2.4.70-465504) ...

          Setting up ids-amdgpu-pro (1.0.0-465504) ...

          Setting up libdrm-amdgpu-pro-amdgpu1:amd64 (1:2.4.70-465504) ...

          Setting up libdrm-amdgpu-pro-amdgpu1:i386 (1:2.4.70-465504) ...

          Setting up libgbm1-amdgpu-pro-base (17.30-465504) ...

          Setting up libgbm1-amdgpu-pro:amd64 (17.30-465504) ...

          Setting up libgbm1-amdgpu-pro:i386 (17.30-465504) ...

          Setting up libgl1-amdgpu-pro-appprofiles (17.30-465504) ...

          Setting up libgl1-amdgpu-pro-glx:amd64 (17.30-465504) ...

          Setting up libgl1-amdgpu-pro-glx:i386 (17.30-465504) ...

          Setting up libgl1-amdgpu-pro-ext:amd64 (17.30-465504) ...

          Setting up libgl1-amdgpu-pro-dri:amd64 (17.30-465504) ...

          Setting up libgl1-amdgpu-pro-dri:i386 (17.30-465504) ...

          Setting up libegl1-amdgpu-pro:amd64 (17.30-465504) ...

          Setting up libegl1-amdgpu-pro:i386 (17.30-465504) ...

          Setting up libgles2-amdgpu-pro:amd64 (17.30-465504) ...

          Setting up libgles2-amdgpu-pro:i386 (17.30-465504) ...

          Setting up libopencl1-amdgpu-pro:amd64 (17.30-465504) ...

          Setting up libopencl1-amdgpu-pro:i386 (17.30-465504) ...

          Setting up clinfo-amdgpu-pro (17.30-465504) ...

          Setting up opencl-amdgpu-pro-icd:amd64 (17.30-465504) ...

          Setting up opencl-amdgpu-pro-icd:i386 (17.30-465504) ...

          Setting up vulkan-amdgpu-pro:amd64 (17.30-465504) ...

          Setting up vulkan-amdgpu-pro:i386 (17.30-465504) ...

          Setting up libvdpau-amdgpu-pro:amd64 (1:17.0.1-465504) ...

          Setting up libvdpau1:i386 (1.1.1-3ubuntu1) ...

          Setting up libvdpau-amdgpu-pro:i386 (1:17.0.1-465504) ...

          Setting up xserver-xorg-video-glamoregl-amdgpu-pro:amd64 (1.19.0-465504) ...

          Setting up xserver-xorg-video-amdgpu-pro (1:1.3.99-465504) ...

          Setting up amdgpu-pro (17.30-465504) ...

          Setting up amdgpu-pro-lib32 (17.30-465504) ...

          Setting up mesa-vdpau-drivers:i386 (17.0.7-0ubuntu0.16.04.1) ...

          Setting up vdpau-driver-all:i386 (1.1.1-3ubuntu1) ...

          Processing triggers for initramfs-tools (0.122ubuntu8.8) ...

          update-initramfs: Generating /boot/initrd.img-4.4.0-96-generic

          Processing triggers for libc-bin (2.23-0ubuntu9) ...

          }}}

           

          rebooting...

           

          ---

           

          <UPDATE>

          SAME VERDICT; (even after re-installation of the latest v17.30 driver, system fails in the same way) -- had to remove it again and drop back to opensource radeon driver in order to post this update

          • Re: amdgpu-pro 17.30 linux "radeon_setup_kernel_mem failed
            fermulator

            I have passed this off to AMD technical support - hopefully will get progress soonj

            (for now continuing to use the opensource radeon driver so that I can at least work... obviously not a long-term solution though as the opensource driver is less-capable w.r.t. performance/gaming)

            • Re: amdgpu-pro 17.30 linux "radeon_setup_kernel_mem failed
              fermulator

              Further research/questions...

               

              * AMDGPU-PRO Driver Compatibility Advisory with Ubuntu 16.04.2 and 16.04.3  - i presume does not apply to me since I'm still on kernel 4.4

              * wondering if there was some important note/warning in 17.20 (does that version exist) that I missed? (since I jumped from 17.10 to 17.30 ... I only reviewed the 17.30 release notes, unable to find "archived release notes" very easily...)

               

              --

              UGH:

              what is the difference between these two?

               

              Radeon Pro Software 17.Q3 Enterprise for Linux Release Notes

              -> links to  amdgpu-pro-17.30-458935.tar.xz

              &

              AMDGPU-PRO Driver for Linux Release Notes

              -> links to amdgpu-pro-17.30-465504.tar.xz

               

              I installed amdgpu-pro-17.30-465504... (which by numerical number is a "higher version") - but no idea why there are two versions.

              • Re: amdgpu-pro 17.30 linux "radeon_setup_kernel_mem failed
                fermulator

                further notes ... noticing the following RE vulkan  requirements...

                 

                To use Vulkan driver in this stack, Vulkan SDK Version 1.0.51.0 needs to be installed. The SDK can be downloaded from : https://vulkan.lunarg.com/sdk/home

                 

                 

                $ dpkg --list | grep vulkan
                ii  libvulkan-dev:amd64                                        1.0.42.0+dfsg1-1ubuntu1~16.04.1              amd64        Vulkan loader library -- development files
                ii  libvulkan1:amd64                                            1.0.42.0+dfsg1-1ubuntu1~16.04.1              amd64        Vulkan loader library
                ii  vulkan-utils                                                1.0.42.0+dfsg1-1ubuntu1~16.04.1              amd64        Miscellaneous Vulkan utilities

                 

                hmmm, this could be part of the problem,

                (of course, this is a snip/snap of the packages WITHOUT the amdgpu-pro driver installed, but it could be a source of conflict perhaps...)

                 

                NOTE, for a previous version of amdgpu-pro driver installation I had followed these instructions to install vulkan support How to Install LunarG Vulkan™ SDK for Ubuntu

                • Re: amdgpu-pro 17.30 linux "radeon_setup_kernel_mem failed
                  fermulator

                  AMD replied a while ago with this:

                  {{{

                  We analyzed your issue and notice that we are missing your BIOS Version. If your BIOS is not up to date, it could cause different issues with newer drivers.

                  the latest BIOS for your SABERTOOTH 990FX R2.0 Motherboard is Version 2901 which you can find here:
                  https://www.asus.com/us/Motherboards/SABERTOOTH_990FX_R20/HelpDesk_BIOS/

                  }}}

                   

                  My BIOS _was_ at the previous version, but the 2901 version only indicates "system stability" -- I have my doubts that this update will address the problem, none-the-less retrying now.

                   

                  New BIOS to confirm the updated version

                  {{{

                  BIOS Information

                      Vendor: American Megatrends Inc.

                      Version: 2901

                      Release Date: 05/04/2016

                      Address: 0xF0000

                      Runtime Size: 64 kB

                      ROM Size: 8192 kB

                      Characteristics:

                          PCI is supported

                          BIOS is upgradeable

                          BIOS shadowing is allowed

                          Boot from CD is supported

                          Selectable boot is supported

                          BIOS ROM is socketed

                          EDD is supported

                          5.25"/1.2 MB floppy services are supported (int 13h)

                          3.5"/720 kB floppy services are supported (int 13h)

                          3.5"/2.88 MB floppy services are supported (int 13h)

                          Print screen service is supported (int 5h)

                          8042 keyboard services are supported (int 9h)

                          Serial services are supported (int 14h)

                          Printer services are supported (int 17h)

                          ACPI is supported

                          USB legacy is supported

                          BIOS boot specification is supported

                          Targeted content distribution is supported

                          UEFI is supported

                      BIOS Revision: 4.6

                  }}}

                  • Re: amdgpu-pro 17.30 linux "radeon_setup_kernel_mem failed
                    fermulator

                    17.40 was released since this report, so I'm working with that going forward.

                     

                    Things I retested;

                    1. post BIOS update, installed amdgpu-pro-17.40-492261, rebooted
                      1. same issue (reboot into recovery mode, enable RW on /, drop to root shell)
                      2. uninstall driver (amdgpu-pro-uninstall)
                    2. did a full system update (only difference from above is now on kernel 4.4.0-98), still Ubuntu 16.04.3 LTS desktop, rebooted
                      1. broken, same issue, rinse and repeat
                      2. uninstall driver (...)

                     

                    Now back on the open source. E-mailing AMD to update the ticket.

                    • Re: amdgpu-pro 17.30 linux "radeon_setup_kernel_mem failed
                      fermulator

                      Well, my R9 290X died.   Like a sucker, I bought a new RX570. (feel slightly sheepish buying another Radeon card after the last one failed ... but it was a Sapphire ... so chalked it up to blame the mfg vendor of the board, not the chipset...) [EDIT]  - to be clear though, I stuck with team red and purchased AMD again due to all the awesome Linux and FOSS contributions (i.e. Mantle)

                       

                      Anyway, after that, the system wouldn't boot at all (not on the open source drivers) -- tried 17.50, failed to startx...

                       

                      Fforced to upgrade to Ubuntu 17.04 to get the system working.  Installed 17.50 drivers successfully subsequently after.

                      {{{

                      $ sudo lshw -C video

                        *-display               

                             description: VGA compatible controller

                             product: Ellesmere [Radeon RX 470/480] <-- heh; oops, doesn't report correctly yet

                             vendor: Advanced Micro Devices, Inc. [AMD/ATI]

                             physical id: 0

                             bus info: pci@0000:07:00.0

                             version: ef

                             width: 64 bits

                             clock: 33MHz

                             capabilities: pm pciexpress msi vga_controller bus_master cap_list rom

                             configuration: driver=amdgpu latency=0

                             resources: irq:63 memory:c0000000-cfffffff memory:d0000000-d01fffff ioport:c000(size=256) memory:fe600000-fe63ffff memory:c0000-dffff

                       

                      $ dpkg -s amdgpu-pro

                      Package: amdgpu-pro

                      Status: install ok installed

                      Priority: optional

                      Section: metapackages

                      Installed-Size: 7

                      Maintainer: Advanced Micro Devices (AMD) <slava.grigorev@amd.com>

                      Architecture: amd64

                      Source: amdgpu

                      Version: 17.50-511655

                      Depends: amdgpu (= 17.50-511655), amdgpu-pro-core (= 17.50-511655), libgl1-amdgpu-pro-glx (= 17.50-511655), libegl1-amdgpu-pro (= 17.50-511655), libgles2-amdgpu-pro (= 17.50-511655), libgl1-amdgpu-pro-ext (= 17.50-511655), libgl1-amdgpu-pro-dri (= 17.50-511655), libgl1-amdgpu-pro-appprofiles (= 17.50-511655), libgbm1-amdgpu-pro (= 17.50-511655), libgbm1-amdgpu-pro-base (= 17.50-511655)

                      Description: Meta package to install amdgpu Pro components.

                      }}}

                       

                      Therefore, since I no longer have an operational card with a system on an operating system version which exhibited the issue, my thread/bug report here is OBSOLETE.

                       

                      None-the-less, I strongly suspect a real bug somewhere in the drivers with that kernel version on Ubuntu 16.04.4... I can no longer assist with analysis.