12 Replies Latest reply on Jul 26, 2017 1:20 PM by jedwards

    clinfo crashes with Vega Frontier Edition

    pdxtabs

      Hello,

       

      I just got an AMD Vega Frontier Edition, which is my first AMD OpenCL enabled card. As such, I sort of don't know what I am doing. I performed the following actions:

       

      1. Installed a fresh copy of Ubuntu 16.04.2 with all updates, rebooted.
      2. Installed "Radeon Vega Frontier Edition for Ubuntu 16.04.2 17.6" driver (amdgpu-pro-17.20-445420.tar.xz), rebooted.
      3. Installed APP-SDK 3.0.130 (AMD-APP-SDKInstaller-v3.0.130.136-GA-linux64.tar.bz2), rebooted.

       

      However, when I run clinfo (or sudo clinfo) it crashes:

      $ sudo clinfo

      terminate called after throwing an instance of 'cl::Error'

        what():  clGetPlatformIDs

      Aborted (core dumped)

       

      Did I miss a step?

       

      My system is as follows:

      • AMD Graphics Card
        • AMD Graphics Card: Vega Frontier Edition
      • Desktop or Laptop System
        • Desktop
      • Operating System
        • Ubuntu 16.04.2, with all updates
      • Driver version installed
        • "Radeon Vega Frontier Edition for Ubuntu 16.04.2 17.6" driver (amdgpu-pro-17.20-445420.tar.xz)
      • Display Devices
        • Dell U2413f, DVI (with adapter from Frontier Edition box), 1920x1200 @ 60Hz
      • Motherboard + Bios Revision
        • MSI B75MA-P45 BIOS 1.9 (latest)
      • CPU/APU
        • Intel i5-3470
      • Power Supply Unit  Make, Model & Wattage
        • EVGA SuperNOVA 1000 G2
      • RAM
        • 8GB

      I have attached a full dmesg, strace of clinfo, and the coredump of clinfo.

       

      Message was edited by: Tabor Kelly Adding strace.

        • Re: clinfo crashes with Vega Frontier Edition
          goodplay

          fsadough may be able to point you in the right direction.

            • Re: clinfo crashes with Vega Frontier Edition
              fsadough

              Can you please do a system check?

               

              System Check
              The easiest way to find out if you have AMDGPU-Pro already installed on your Ubuntu System is to query the Debian package manager.

              Using the following command at a terminal will provide you with the version of the AMDGPU-Pro stack on your system, or inform you that there are no packages found:

              dpkg -l amdgpu-pro

                • Re: clinfo crashes with Vega Frontier Edition
                  pdxtabs

                  Okay, I'm not sure how to interpret this:

                  $ dpkg -l amdgpu-pro

                  Desired=Unknown/Install/Remove/Purge/Hold

                  | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend

                  |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)

                  ||/ Name                  Version         Architecture    Description

                  +++-=====================-===============-===============-================================================

                  ii  amdgpu-pro            17.20-445420    amd64           Meta package to install amdgpu Pro components.

              • Re: clinfo crashes with Vega Frontier Edition
                gstoner

                For the Frontier Edtion version of OpenCL, you do not need to install OpenCL SDK 3.0 to do development.    You should not install, what it did is over write the correct path to where the OpenCL components are which is causing the crash

                 

                Configuring the environment

                The LLVM_BIN environment variable needs to be set prior to running applications that require OpenCL.

                 

                 

                Set it permanently in bash, for all users:

                 

                 

                echo 'export LLVM_BIN=/opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.sh

                Set it permanently in csh, for all users:

                 

                 

                echo 'setenv LLVM_BIN /opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.csh

                 

                Please see this blog how best to install the driver

                  • Re: clinfo crashes with Vega Frontier Edition
                    pdxtabs

                    I wiped and reloaded my Ubuntu 16.04.2 installation and followed the instruction in the blog post:

                    tar -Jxvf amdgpu-pro-17.20-445420.tar.xz

                    cd amdgpu-pro-17.20-445420

                    ./amdgpu-pro-install -y

                    sudo apt install -y rocm-amdgpu-pro

                    echo 'export LLVM_BIN=/opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.sh

                    echo 'setenv LLVM_BIN /opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.csh

                    sudo reboot

                    However, even though I have amdgpu-pro and rocm-amdgpu-pr installed and LLVM_BIN set correctly, clinfo still crashes:

                    $ dpkg -l amdgpu-pro

                    Desired=Unknown/Install/Remove/Purge/Hold

                    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend

                    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)

                    ||/ Name           Version      Architecture Description

                    +++-==============-============-============-=================================

                    ii  amdgpu-pro     17.20-445420 amd64        Meta package to install amdgpu Pr

                    $ dpkg -l rocm-amdgpu-pro

                    Desired=Unknown/Install/Remove/Purge/Hold

                    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend

                    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)

                    ||/ Name           Version      Architecture Description

                    +++-==============-============-============-=================================

                    ii  rocm-amdgpu-pr 17.20-445420 amd64        Meta package to install OpenCL/RO

                    $ /opt/amdgpu-pro/bin/clinfo

                    terminate called after throwing an instance of 'cl::Error'

                      what():  clGetPlatformIDs

                    Aborted (core dumped)

                    glxgears and glxinfo work fine. I have attached the output from glxinfo.

                     

                    If someone at AMD would like to debug on my hardware, you are welcome to borrow it.

                  • Re: clinfo crashes with Vega Frontier Edition
                    gstoner

                    Did you try typing sudo ./clinfo

                     

                    Do not install the APP-SDK 3.0.130 it also installs OpenCL Headers which overwrite the correct headers.

                      • Re: clinfo crashes with Vega Frontier Edition
                        pdxtabs

                        I did not install the APP-SDK after a clean install of Ubuntu 16.04.2 (with all Ubuntu updates).

                        $ cd /opt/amdgpu-pro/bin/

                        $ sudo ./clinfo

                        [sudo] password for XXXXX:

                        terminate called after throwing an instance of 'cl::Error'

                          what():  clGetPlatformIDs

                        Aborted (core dumped)

                        I started this thread 9 days ago. Should I open a support case? Will I get better support? Do you want to borrow my hardware?

                          • Re: clinfo crashes with Vega Frontier Edition
                            gstoner

                            We walking you through the step to get where we understand why you're getting the crash.  These forums are really for community support,  with a moderator.   I manage the team that looks at ROCm and OpenCL, which why I step in to help in 4 days ago to help.   I may need to go to the AMGPUpro Linux team since they repackage our software if packaging issue. We do know this is working on other systems.

                             

                            Step one we had to unwind the AMDSDK from the stack which I ask the Prographic team to update their instruction to not install this

                             

                            Now we have to look at harder issues

                            -  Also, you did not have processor details,  is this Core I5 v3 Haswell processor

                            - Is the GPU in the PCI_E2 slot according to the MSI user manual to get a full x16 electrical slot

                            - Next, did you install the Intel OpenCL SDK?

                             

                            Need the output of

                            ls /etc/OpenCL/vendors

                             

                            We need to see the output logs  of the following

                            sudo lspic -tvv

                            sudo lspci -xxxx

                            sudo lspci -vvv

                             

                             

                            Another option is to installed and test Opensource  ROCm driver,  which what we work on.  Use the following instruction.  Note Monday we are rolling out ROCm 1.6.1 driver it addressing some issue we found in the Power Mangement firmware.   This driver my team roles out so we can debug issue quickly.

                            ROCm Install

                            Quickstart OpenCL 

                             

                            In your demesg log I saw this error I need to talk AMDGPU team on Monday about

                            One thing I am seeing is you need to talk to MSI about ACPI issue

                            [    1.057762] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

                              1.495876] amdgpu: [powerplay] Cannot find requested DCEFCLK!

                            [    1.764797] amdgpu: [powerplay] Cannot find requested DCEFCLK!

                             

                            I look over more your strace

                              • Re: clinfo crashes with Vega Frontier Edition
                                pdxtabs

                                Now we have to look at harder issues

                                - Also, you did not have processor details, is this Core I5 v3 Haswell processor

                                - Is the GPU in the PCI_E2 slot according to the MSI user manual to get a full x16 electrical slot

                                - Next, did you install the Intel OpenCL SDK?

                                As stated in the first post, this is the processor: http://ark.intel.com/products/68316/Intel-Core-i5-3470-Processor-6M-Cache-up-to-3_60-GHz

                                I have the card plugged into the only 16 lane PCIe slot on the motherboard ("PCI_E1 PCIe x16 Expansion Slot" by my reading of the manual).

                                I did not install the Intel OpenCL SDK.

                                 

                                Need the output of

                                ls /etc/OpenCL/vendors

                                 

                                We need to see the output logs of the following

                                sudo lspic -tvv

                                sudo lspci -xxxx

                                sudo lspci -vvv

                                Please find attached. Of note, there is an error when executing sudo lspci -vvv "pcilib: sysfs_read_vpd: read failed: Input/output error" (I included it in the log).

                                Another option is to installed and test Opensource ROCm driver, which what we work on. Use the following instruction. Note Monday we are rolling out ROCm 1.6.1 driver it addressing some issue we found in the Power Mangement firmware. This driver my team roles out so we can debug issue quickly.

                                ROCm Install

                                Quickstart OpenCL

                                The open source ROCm has the following to say:

                                Supported CPU

                                ROCm Platform Leverage modern CPU with support with PCIe Gen 3 which aslo support PCIe Atomics (Fetch ADD,Compare and SWAP, Unconditional SWAP, AtomicsOpCompletion) To find out more about https://github.com/RadeonOpenCompute/RadeonOpenCompute.github.io/blob/master/ROCmPCIeFeatures.md’

                                When you install your GPU’s Make sure you install them on real PCIe Gen3 x16 or x8 lanes directly on CPU’s Root I/O controller or a PCIe Switch directly attached to the CPU’s Root I/O controller. We have seen many issue with Consumer motherboard which support Physical x16 Connectors, but the connector is electrically connected as PCIe Gen2 x4, if you see this it is typically hanging off the Southbridge PCIe I/O controller. If your motherboard is configured this way please do not use this connector for your GPU.

                                I have no idea if my motherboard meets these requirements. Is this an undocumented requirement for OpenCL on this card? None of the marketing material that I read before purchasing this card called this out as a requirement (https://pro.radeon.com/en-us/product/radeon-vega-frontier-edition/):

                                Requirements:

                                • Typical Board Power: 300W
                                • PSU Recommendation: >850W
                                • Required PCI Slots: 2

                                The owners manual says:

                                SYSTEM REQUIREMENTS

                                ...

                                • PCI Express-based PC with at least one x16 lane graphics slot available on the motherboard.
                                • Min 750W System power supply with two 8-pin PCIe power connectors.

                                Thank you for your help. I have relocated the workstation in question to make it easier to supply any additional information that you may need on a Monday-Friday basis. If I had to buy a new motherboard and processor to use this card it wouldn't be the end of the world, but it would be disappointing as I picked it out as the most card that I could stuff into this workstation (only one PCI-e x16 slot) and I've already spent a lot of money to get it working (card, power supply, shipping, tax).

                                  • Re: clinfo crashes with Vega Frontier Edition
                                    jedwards

                                    First, it looks like all of the configuration files are correct (the vendor file, location of the libraries, etc.), but I found this in your dmesg output: [    1.699813] kfd kfd: skipped device 1002:6863, PCI rejects atomics

                                    .

                                    This indicates that the KFD driver tried to initialize the 6863 device, but failed because PCIe atomics are not supported in the current configuration that you have. This is most likely because of the PCIe slot the card is installed in, because without atomic support, the ROCm driver stack will not be able to recognize the card. However, I noticed this was the specification for the PCIe slots on your mother board:

                                    .

                                    Expansion Slot(s) : 1 x processor - LGA1155 Socket

                                    4 x memory - DIMM 240-pin

                                    1 x PCI Express 3.0 x16

                                    1 x PCI Express 2.0 x1

                                    1 x PCI

                                    .

                                    Your slot is 3.0 capable, but I can't tell if it enables the atomic extension. I will need to analyze your lspci output to make sure.