cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pdxtabs
Adept I

clinfo crashes with Vega Frontier Edition

Hello,

I just got an AMD Vega Frontier Edition, which is my first AMD OpenCL enabled card. As such, I sort of don't know what I am doing. I performed the following actions:

  1. Installed a fresh copy of Ubuntu 16.04.2 with all updates, rebooted.
  2. Installed "Radeon Vega Frontier Edition for Ubuntu 16.04.2 17.6" driver (amdgpu-pro-17.20-445420.tar.xz), rebooted.
  3. Installed APP-SDK 3.0.130 (AMD-APP-SDKInstaller-v3.0.130.136-GA-linux64.tar.bz2), rebooted.

However, when I run clinfo (or sudo clinfo) it crashes:

$ sudo clinfo

terminate called after throwing an instance of 'cl::Error'

  what():  clGetPlatformIDs

Aborted (core dumped)

Did I miss a step?

My system is as follows:

  • AMD Graphics Card
    • AMD Graphics Card: Vega Frontier Edition
  • Desktop or Laptop System
    • Desktop
  • Operating System
    • Ubuntu 16.04.2, with all updates
  • Driver version installed
    • "Radeon Vega Frontier Edition for Ubuntu 16.04.2 17.6" driver (amdgpu-pro-17.20-445420.tar.xz)
  • Display Devices
    • Dell U2413f, DVI (with adapter from Frontier Edition box), 1920x1200 @ 60Hz
  • Motherboard + Bios Revision
    • MSI B75MA-P45 BIOS 1.9 (latest)
  • CPU/APU
    • Intel i5-3470
  • Power Supply Unit  Make, Model & Wattage
    • EVGA SuperNOVA 1000 G2
  • RAM
    • 8GB

I have attached a full dmesg, strace of clinfo, and the coredump of clinfo.

Message was edited by: Tabor Kelly

Adding strace.

0 Likes
12 Replies

fsadough​ may be able to point you in the right direction.

Ryzen 5 5600x, B550 aorus pro ac, Hyper 212 black, 2 x 16gb F4-3600c16dgtzn kit, Aorus gen4 1tb, Nitro+RX6900XT, RM850, Win.10 Pro., LC27G55T..
0 Likes

Can you please do a system check?

System Check
The easiest way to find out if you have AMDGPU-Pro already installed on your Ubuntu System is to query the Debian package manager.

Using the following command at a terminal will provide you with the version of the AMDGPU-Pro stack on your system, or inform you that there are no packages found:

dpkg -l amdgpu-pro

0 Likes

Okay, I'm not sure how to interpret this:

$ dpkg -l amdgpu-pro

Desired=Unknown/Install/Remove/Purge/Hold

| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend

|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)

||/ Name                  Version         Architecture    Description

+++-=====================-===============-===============-================================================

ii  amdgpu-pro            17.20-445420    amd64           Meta package to install amdgpu Pro components.

0 Likes
gstoner
Staff

For the Frontier Edtion version of OpenCL, you do not need to install OpenCL SDK 3.0 to do development.    You should not install, what it did is over write the correct path to where the OpenCL components are which is causing the crash

Configuring the environment

The LLVM_BIN environment variable needs to be set prior to running applications that require OpenCL.

Set it permanently in bash, for all users:

echo 'export LLVM_BIN=/opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.sh

Set it permanently in csh, for all users:

echo 'setenv LLVM_BIN /opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.csh

Please see this blog how best to install the driver

0 Likes

Yes this is the post.   

0 Likes

I wiped and reloaded my Ubuntu 16.04.2 installation and followed the instruction in the blog post:

​tar -Jxvf amdgpu-pro-17.20-445420.tar.xz

cd amdgpu-pro-17.20-445420

./amdgpu-pro-install -y

sudo apt install -y rocm-amdgpu-pro

echo 'export LLVM_BIN=/opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.sh

echo 'setenv LLVM_BIN /opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.csh

sudo reboot

However, even though I have amdgpu-pro and rocm-amdgpu-pr installed and LLVM_BIN set correctly, clinfo still crashes:

$ dpkg -l amdgpu-pro

Desired=Unknown/Install/Remove/Purge/Hold

| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend

|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)

||/ Name           Version      Architecture Description

+++-==============-============-============-=================================

ii  amdgpu-pro     17.20-445420 amd64        Meta package to install amdgpu Pr

$ dpkg -l rocm-amdgpu-pro

Desired=Unknown/Install/Remove/Purge/Hold

| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend

|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)

||/ Name           Version      Architecture Description

+++-==============-============-============-=================================

ii  rocm-amdgpu-pr 17.20-445420 amd64        Meta package to install OpenCL/RO

$ /opt/amdgpu-pro/bin/clinfo

terminate called after throwing an instance of 'cl::Error'

  what():  clGetPlatformIDs

Aborted (core dumped)

glxgears and glxinfo work fine. I have attached the output from glxinfo.

If someone at AMD would like to debug on my hardware, you are welcome to borrow it.

0 Likes
gstoner
Staff

Did you try typing sudo ./clinfo

Do not install the APP-SDK 3.0.130 it also installs OpenCL Headers which overwrite the correct headers.

0 Likes

I did not install the APP-SDK after a clean install of Ubuntu 16.04.2 (with all Ubuntu updates).

$ cd /opt/amdgpu-pro/bin/

$ sudo ./clinfo

[sudo] password for XXXXX:

terminate called after throwing an instance of 'cl::Error'

  what():  clGetPlatformIDs

Aborted (core dumped)

I started this thread 9 days ago. Should I open a support case? Will I get better support? Do you want to borrow my hardware?

0 Likes

We walking you through the step to get where we understand why you're getting the crash.  These forums are really for community support,  with a moderator.   I manage the team that looks at ROCm and OpenCL, which why I step in to help in 4 days ago to help.   I may need to go to the AMGPUpro Linux team since they repackage our software if packaging issue. We do know this is working on other systems.

Step one we had to unwind the AMDSDK from the stack which I ask the Prographic team to update their instruction to not install this

Now we have to look at harder issues

-  Also, you did not have processor details,  is this Core I5 v3 Haswell processor

- Is the GPU in the PCI_E2 slot according to the MSI user manual to get a full x16 electrical slot

- Next, did you install the Intel OpenCL SDK?

Need the output of

ls /etc/OpenCL/vendors

We need to see the output logs  of the following

sudo lspic -tvv

sudo lspci -xxxx

sudo lspci -vvv

Another option is to installed and test Opensource  ROCm driver,  which what we work on.  Use the following instruction.  Note Monday we are rolling out ROCm 1.6.1 driver it addressing some issue we found in the Power Mangement firmware.   This driver my team roles out so we can debug issue quickly.

ROCm Install

Quickstart OpenCL 

In your demesg log I saw this error I need to talk AMDGPU team on Monday about

One thing I am seeing is you need to talk to MSI about ACPI issue

[    1.057762] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

  1.495876] amdgpu: [powerplay] Cannot find requested DCEFCLK!

[    1.764797] amdgpu: [powerplay] Cannot find requested DCEFCLK!

I look over more your strace

0 Likes

Now we have to look at harder issues

- Also, you did not have processor details, is this Core I5 v3 Haswell processor

- Is the GPU in the PCI_E2 slot according to the MSI user manual to get a full x16 electrical slot

- Next, did you install the Intel OpenCL SDK?

As stated in the first post, this is the processor: http://ark.intel.com/products/68316/Intel-Core-i5-3470-Processor-6M-Cache-up-to-3_60-GHz

I have the card plugged into the only 16 lane PCIe slot on the motherboard ("PCI_E1 PCIe x16 Expansion Slot" by my reading of the manual).

I did not install the Intel OpenCL SDK.

Need the output of

ls /etc/OpenCL/vendors

We need to see the output logs of the following

sudo lspic -tvv

sudo lspci -xxxx

sudo lspci -vvv

Please find attached. Of note, there is an error when executing sudo lspci -vvv "pcilib: sysfs_read_vpd: read failed: Input/output error" (I included it in the log).

Another option is to installed and test Opensource ROCm driver, which what we work on. Use the following instruction. Note Monday we are rolling out ROCm 1.6.1 driver it addressing some issue we found in the Power Mangement firmware. This driver my team roles out so we can debug issue quickly.

ROCm Install

Quickstart OpenCL

The open source ROCm has the following to say:

Supported CPU

ROCm Platform Leverage modern CPU with support with PCIe Gen 3 which aslo support PCIe Atomics (Fetch ADD,Compare and SWAP, Unconditional SWAP, AtomicsOpCompletion) To find out more about https://github.com/RadeonOpenCompute/RadeonOpenCompute.github.io/blob/master/ROCmPCIeFeatures.md’

When you install your GPU’s Make sure you install them on real PCIe Gen3 x16 or x8 lanes directly on CPU’s Root I/O controller or a PCIe Switch directly attached to the CPU’s Root I/O controller. We have seen many issue with Consumer motherboard which support Physical x16 Connectors, but the connector is electrically connected as PCIe Gen2 x4, if you see this it is typically hanging off the Southbridge PCIe I/O controller. If your motherboard is configured this way please do not use this connector for your GPU.

I have no idea if my motherboard meets these requirements. Is this an undocumented requirement for OpenCL on this card? None of the marketing material that I read before purchasing this card called this out as a requirement (https://pro.radeon.com/en-us/product/radeon-vega-frontier-edition/😞

Requirements:

  • Typical Board Power: 300W
  • PSU Recommendation: >850W
  • Required PCI Slots: 2

The owners manual says:

SYSTEM REQUIREMENTS

...

  • PCI Express-based PC with at least one x16 lane graphics slot available on the motherboard.
  • Min 750W System power supply with two 8-pin PCIe power connectors.

Thank you for your help. I have relocated the workstation in question to make it easier to supply any additional information that you may need on a Monday-Friday basis. If I had to buy a new motherboard and processor to use this card it wouldn't be the end of the world, but it would be disappointing as I picked it out as the most card that I could stuff into this workstation (only one PCI-e x16 slot) and I've already spent a lot of money to get it working (card, power supply, shipping, tax).

First, it looks like all of the configuration files are correct (the vendor file, location of the libraries, etc.), but I found this in your dmesg output: [    1.699813] kfd kfd: skipped device 1002:6863, PCI rejects atomics

.

This indicates that the KFD driver tried to initialize the 6863 device, but failed because PCIe atomics are not supported in the current configuration that you have. This is most likely because of the PCIe slot the card is installed in, because without atomic support, the ROCm driver stack will not be able to recognize the card. However, I noticed this was the specification for the PCIe slots on your mother board:

.

Expansion Slot(s) : 1 x processor - LGA1155 Socket

4 x memory - DIMM 240-pin

1 x PCI Express 3.0 x16

1 x PCI Express 2.0 x1

1 x PCI

.

Your slot is 3.0 capable, but I can't tell if it enables the atomic extension. I will need to analyze your lspci output to make sure.

0 Likes