8 Replies Latest reply on Mar 16, 2018 10:09 AM by wdormann

    Trouble with getting amdgpu drivers working with Vega on Ubuntu 16.04.3

    wdormann

      Hi folks,

      I've been having a difficult time getting the drivers working (so that I can use OpenCL) on my Ubuntu 16.04.3 system with a Vega RX 64.  I'm using the 17.40 drivers.  For starters:

      The drivers do not work with a clean Ubuntu 16.04.3 install.  The reason:  Ubuntu 16.04.3 does not install with the Hardware Enablement Stack enabled by default.  If you want any hope of getting the drivers (even to just run X) working, you must run:

      sudo apt install --install-recommends linux-generic-hwe-16.04 xserver-xorg-hwe-16.04

      If you don't do this, you'll get warnings about ABI mismatches.  And if you disable ABI version checking in your xorg.conf file, you'll just end up crashing X.  This should be very clearly documented in the install guide.  But better yet, the installer for amdgpu-pro should enforce that the system has HWE.  Otherwise, users are going to be pulling their hair out!

       

      Now, after I did the above I was able to get X working, but clinfo simply crashes:

       

      $ env LLVM_BIN=/opt/amdgpu-pro/bin /opt/amdgpu-pro/bin/clinfo

      terminate called after throwing an instance of 'cl::Error'

        what():  clGetPlatformIDs

      Aborted (core dumped)

       

      I've rebooted and subsequently installed rocm as the install guide recommends:

      sudo apt install -y rocm-amdgpu-pro

       

      I'm not sure that it's any help, but here's gdb info for the crash:

       

      $ gdb $LLVM_BIN/clinfo

      GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1

      Copyright (C) 2016 Free Software Foundation, Inc.

      License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

      This is free software: you are free to change and redistribute it.

      There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

      and "show warranty" for details.

      This GDB was configured as "x86_64-linux-gnu".

      Type "show configuration" for configuration details.

      For bug reporting instructions, please see:

      <http://www.gnu.org/software/gdb/bugs/>.

      Find the GDB manual and other documentation resources online at:

      <http://www.gnu.org/software/gdb/documentation/>.

      For help, type "help".

      Type "apropos word" to search for commands related to "word"...

      Reading symbols from /opt/amdgpu-pro/bin/clinfo...(no debugging symbols found)...done.

      (gdb) r

      Starting program: /opt/amdgpu-pro/bin/clinfo

      [Thread debugging using libthread_db enabled]

      Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

      terminate called after throwing an instance of 'cl::Error'

        what():  clGetPlatformIDs

       

      Program received signal SIGABRT, Aborted.

      0x00007ffff6cf3428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

      54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

      (gdb) bt

      #0  0x00007ffff6cf3428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

      #1  0x00007ffff6cf502a in __GI_abort () at abort.c:89

      #2  0x000000000045b405 in ?? ()

      #3  0x000000000045a1f6 in ?? ()

      #4  0x000000000045a223 in ?? ()

      #5  0x000000000045a32e in ?? ()

      #6  0x0000000000407b5d in ?? ()

      #7  0x000000000040f699 in ?? ()

      #8  0x0000000000407c12 in ?? ()

      #9  0x00007ffff6cde830 in __libc_start_main (main=0x407b60, argc=1, argv=0x7fffffffe598, init=<optimized out>,

          fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe588) at ../csu/libc-start.c:291

      #10 0x000000000040e741 in ?? ()

       

      If I run the ubunu-provided clinfo application, I get:

       

      $ clinfo

      Number of platforms                               0

       

      Device info:

       

      04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 687f (rev c1) (prog-if 00 [VGA controller])

              Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 6b76

              Flags: bus master, fast devsel, latency 0, IRQ 28

              Memory at c0000000 (64-bit, prefetchable) [size=256M]

              Memory at d0000000 (64-bit, prefetchable) [size=2M]

              I/O ports at dc00 [size=256]

              Memory at fcb80000 (32-bit, non-prefetchable) [size=512K]

              Expansion ROM at 000c0000 [disabled] [size=128K]

              Capabilities: [48] Vendor Specific Information: Len=08 <?>

              Capabilities: [50] Power Management version 3

              Capabilities: [64] Express Legacy Endpoint, MSI 00

              Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+

              Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>

              Capabilities: [150] Advanced Error Reporting

              Capabilities: [200] #15

              Capabilities: [270] #19

              Capabilities: [2a0] Access Control Services

              Capabilities: [2b0] Address Translation Service (ATS)

              Capabilities: [2c0] #13

              Capabilities: [2d0] #1b

              Capabilities: [320] Latency Tolerance Reporting

              Kernel driver in use: amdgpu

              Kernel modules: amdgpu

       

      strace of clinfo execution is attached.

      Where do I go from here?

        • Re: Trouble with getting amdgpu drivers working with Vega on Ubuntu 16.04.3
          wdormann

          As I look at the strace log, I see that libamdoclsc64.so is missing.  What's not clear is what provides that file, or whether the fact that the file is missing is causing the crash or not.

            • Re: Trouble with getting amdgpu drivers working with Vega on Ubuntu 16.04.3
              wdormann

              One more follow-up:   I've noticed via dmesg this:

              [    6.029263] kfd kfd: skipped device 1002:687f, PCI rejects atomics

               

              Looking into this more, it seems that ROCm requires that the GPU be in a PCI Express 3.0 slot.  However, the card is in a PCI Express 2.0 slot.  My questions at this point:

              1. Is the ROCm ICD required to use OpenCL on Linux with the Vega?
              2. If so, then I would assume that this GPU / Motheboard will not work with Linux.  However, would the same hardware combo work on Windows and allow OpenCL?

               

                • Re: Trouble with getting amdgpu drivers working with Vega on Ubuntu 16.04.3
                  wdormann

                  To answer my own question:

                  Yes, this GPU / Motherboard (with only PCI Express 2.0) combo works just fine with Windows.  The app I intend to use with OpenCL (hashcat) works perfectly.   However, for the sake of any other poor soul attempting to get the drivers/OpenCL working on Linux, it'd be nice if solutions could be posted here.

                  1 of 1 people found this helpful
                    • Re: Trouble with getting amdgpu drivers working with Vega on Ubuntu 16.04.3
                      stueng

                      Hi, so I am in the same boat as you.. trying to get Vega56 working in Ubuntu.

                       

                      I followed the install guide to the letter, I did not have to manually install the hardware enablement stack, after following the install script and installing RCOM it booted into X straight away no issues.

                       

                      clinfo however returns 0 and the env command returns the exact same error that you have supplied.

                       

                      I do have a PCIE v3 system that I can try this one if that helps.. ?

                    • Re: Trouble with getting amdgpu drivers working with Vega on Ubuntu 16.04.3
                      wdormann

                      I tried testing out a different motherboard (Intel DQ77MK) that has PCI Express 3.0, and it still reports:

                      kfd kfd: skipped device 1002:687f, PCI rejects atomics

                       

                      Apparently just PCI Express 3.0 isn't enough to ensure PCI Atomics.  You seem to need to have the right combination of motherboard + CPU to let this work.  This is getting old very quickly...

                        • Re: Trouble with getting amdgpu drivers working with Vega on Ubuntu 16.04.3
                          e97

                          Same issue using LGA 2011 C602 / X79 platform - PCI-E slots are PCI-E 3.0 x16 but kernel log shows:

                           

                          kfd kfd: skipped device 1002:687f, PCI rejects atomics

                           

                          What's the problem here? I'm unable to get my RADEON VEGA 64 recognized as an OpenCL compute device (clinfo shows nothing, rocm-smi -s shows the card and available clocks.

                           

                          I tried 17.40 and 17.50, amdgpu and amdgpu-pro drivers. I also tried --headless and --opencl=rocm and --opencl=legacy. None of them worked.

                           

                          I'm able to successfully use a RX VEGA 64 and 1080 Ti on compute work loads in Windows 10 on the same machine. Why can't I accomplish this on Ubuntu / linux?

                        • Re: Trouble with getting amdgpu drivers working with Vega on Ubuntu 16.04.3
                          wdormann

                          I've finally obtained a CPU and motherboard combo that allows PCIe 3.0 atomics.   And frustratingly, I encountered the exact same behavior as I originally posted.  That is, except for the KFD error.

                           

                          After a lot of troubleshooting, I've found the culprit:

                          For whatever reason, when Linux boots by default I get a screen that is black most of the time, and flickers on to show the contents briefly once every couple of seconds. As a workaround for this, I quickly added a "nomodeset" option into grub.  This solved the screen blinking problem.

                           

                          However, it puts me in the situation where clinfo doesn't detect the GPU!

                           

                          Given all of the troubleshooting steps, I ended up just installing ROCm on a clean Ubuntu installation, rather than the amdgpu-pro drivers, which come with ROCm.  And as long as I don't have "nomodeset" in my grub configuration, opencl works fine.  As this is a headless build, all I need is opencl.  However, if I needed to see what's on the monitor, I think I'd still need to do more troubleshooting.