Hello,
I just got an AMD Vega Frontier Edition, which is my first AMD OpenCL enabled card. As such, I sort of don't know what I am doing. I performed the following actions:
However, when I run clinfo (or sudo clinfo) it crashes:
$ sudo clinfo
terminate called after throwing an instance of 'cl::Error'
what(): clGetPlatformIDs
Aborted (core dumped)
Did I miss a step?
My system is as follows:
I have attached a full dmesg, strace of clinfo, and the coredump of clinfo.
Message was edited by: Tabor Kelly
Adding strace.
fsadough may be able to point you in the right direction.
Can you please do a system check?
System Check
The easiest way to find out if you have AMDGPU-Pro already installed on your Ubuntu System is to query the Debian package manager.
Using the following command at a terminal will provide you with the version of the AMDGPU-Pro stack on your system, or inform you that there are no packages found:
dpkg -l amdgpu-pro
Okay, I'm not sure how to interpret this:
$ dpkg -l amdgpu-pro
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================-===============-===============-================================================
ii amdgpu-pro 17.20-445420 amd64 Meta package to install amdgpu Pro components.
For the Frontier Edtion version of OpenCL, you do not need to install OpenCL SDK 3.0 to do development. You should not install, what it did is over write the correct path to where the OpenCL components are which is causing the crash
Configuring the environment
The LLVM_BIN environment variable needs to be set prior to running applications that require OpenCL.
Set it permanently in bash, for all users:
echo 'export LLVM_BIN=/opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.sh
Set it permanently in csh, for all users:
echo 'setenv LLVM_BIN /opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.csh
Please see this blog how best to install the driver
This blog post? http://gpuopen.com/vega-frontier-installing-the-driver/
Yes this is the post.
I wiped and reloaded my Ubuntu 16.04.2 installation and followed the instruction in the blog post:
tar -Jxvf amdgpu-pro-17.20-445420.tar.xz
cd amdgpu-pro-17.20-445420
./amdgpu-pro-install -y
sudo apt install -y rocm-amdgpu-pro
echo 'export LLVM_BIN=/opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.sh
echo 'setenv LLVM_BIN /opt/amdgpu-pro/bin' | sudo tee /etc/profile.d/amdgpu-pro.csh
sudo reboot
However, even though I have amdgpu-pro and rocm-amdgpu-pr installed and LLVM_BIN set correctly, clinfo still crashes:
$ dpkg -l amdgpu-pro
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============-============-============-=================================
ii amdgpu-pro 17.20-445420 amd64 Meta package to install amdgpu Pr
$ dpkg -l rocm-amdgpu-pro
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============-============-============-=================================
ii rocm-amdgpu-pr 17.20-445420 amd64 Meta package to install OpenCL/RO
$ /opt/amdgpu-pro/bin/clinfo
terminate called after throwing an instance of 'cl::Error'
what(): clGetPlatformIDs
Aborted (core dumped)
glxgears and glxinfo work fine. I have attached the output from glxinfo.
If someone at AMD would like to debug on my hardware, you are welcome to borrow it.
Did you try typing sudo ./clinfo
Do not install the APP-SDK 3.0.130 it also installs OpenCL Headers which overwrite the correct headers.
I did not install the APP-SDK after a clean install of Ubuntu 16.04.2 (with all Ubuntu updates).
$ cd /opt/amdgpu-pro/bin/
$ sudo ./clinfo
[sudo] password for XXXXX:
terminate called after throwing an instance of 'cl::Error'
what(): clGetPlatformIDs
Aborted (core dumped)
I started this thread 9 days ago. Should I open a support case? Will I get better support? Do you want to borrow my hardware?
We walking you through the step to get where we understand why you're getting the crash. These forums are really for community support, with a moderator. I manage the team that looks at ROCm and OpenCL, which why I step in to help in 4 days ago to help. I may need to go to the AMGPUpro Linux team since they repackage our software if packaging issue. We do know this is working on other systems.
Step one we had to unwind the AMDSDK from the stack which I ask the Prographic team to update their instruction to not install this
Now we have to look at harder issues
- Also, you did not have processor details, is this Core I5 v3 Haswell processor
- Is the GPU in the PCI_E2 slot according to the MSI user manual to get a full x16 electrical slot
- Next, did you install the Intel OpenCL SDK?
Need the output of
ls /etc/OpenCL/vendors
We need to see the output logs of the following
sudo lspic -tvv
sudo lspci -xxxx
sudo lspci -vvv
Another option is to installed and test Opensource ROCm driver, which what we work on. Use the following instruction. Note Monday we are rolling out ROCm 1.6.1 driver it addressing some issue we found in the Power Mangement firmware. This driver my team roles out so we can debug issue quickly.
In your demesg log I saw this error I need to talk AMDGPU team on Monday about
One thing I am seeing is you need to talk to MSI about ACPI issue
[ 1.057762] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
1.495876] amdgpu: [powerplay] Cannot find requested DCEFCLK!
[ 1.764797] amdgpu: [powerplay] Cannot find requested DCEFCLK!
I look over more your strace
Now we have to look at harder issues
- Also, you did not have processor details, is this Core I5 v3 Haswell processor
- Is the GPU in the PCI_E2 slot according to the MSI user manual to get a full x16 electrical slot
- Next, did you install the Intel OpenCL SDK?
As stated in the first post, this is the processor: http://ark.intel.com/products/68316/Intel-Core-i5-3470-Processor-6M-Cache-up-to-3_60-GHz
I have the card plugged into the only 16 lane PCIe slot on the motherboard ("PCI_E1 PCIe x16 Expansion Slot" by my reading of the manual).
I did not install the Intel OpenCL SDK.
Need the output of
ls /etc/OpenCL/vendors
We need to see the output logs of the following
sudo lspic -tvv
sudo lspci -xxxx
sudo lspci -vvv
Please find attached. Of note, there is an error when executing sudo lspci -vvv "pcilib: sysfs_read_vpd: read failed: Input/output error" (I included it in the log).
Another option is to installed and test Opensource ROCm driver, which what we work on. Use the following instruction. Note Monday we are rolling out ROCm 1.6.1 driver it addressing some issue we found in the Power Mangement firmware. This driver my team roles out so we can debug issue quickly.
The open source ROCm has the following to say:
ROCm Platform Leverage modern CPU with support with PCIe Gen 3 which aslo support PCIe Atomics (Fetch ADD,Compare and SWAP, Unconditional SWAP, AtomicsOpCompletion) To find out more about https://github.com/RadeonOpenCompute/RadeonOpenCompute.github.io/blob/master/ROCmPCIeFeatures.md’
When you install your GPU’s Make sure you install them on real PCIe Gen3 x16 or x8 lanes directly on CPU’s Root I/O controller or a PCIe Switch directly attached to the CPU’s Root I/O controller. We have seen many issue with Consumer motherboard which support Physical x16 Connectors, but the connector is electrically connected as PCIe Gen2 x4, if you see this it is typically hanging off the Southbridge PCIe I/O controller. If your motherboard is configured this way please do not use this connector for your GPU.
I have no idea if my motherboard meets these requirements. Is this an undocumented requirement for OpenCL on this card? None of the marketing material that I read before purchasing this card called this out as a requirement (https://pro.radeon.com/en-us/product/radeon-vega-frontier-edition/😞
Requirements:
- Typical Board Power: 300W
- PSU Recommendation: >850W
- Required PCI Slots: 2
The owners manual says:
SYSTEM REQUIREMENTS
...
- PCI Express-based PC with at least one x16 lane graphics slot available on the motherboard.
- Min 750W System power supply with two 8-pin PCIe power connectors.
Thank you for your help. I have relocated the workstation in question to make it easier to supply any additional information that you may need on a Monday-Friday basis. If I had to buy a new motherboard and processor to use this card it wouldn't be the end of the world, but it would be disappointing as I picked it out as the most card that I could stuff into this workstation (only one PCI-e x16 slot) and I've already spent a lot of money to get it working (card, power supply, shipping, tax).
First, it looks like all of the configuration files are correct (the vendor file, location of the libraries, etc.), but I found this in your dmesg output: [ 1.699813] kfd kfd: skipped device 1002:6863, PCI rejects atomics
.
This indicates that the KFD driver tried to initialize the 6863 device, but failed because PCIe atomics are not supported in the current configuration that you have. This is most likely because of the PCIe slot the card is installed in, because without atomic support, the ROCm driver stack will not be able to recognize the card. However, I noticed this was the specification for the PCIe slots on your mother board:
.
Expansion Slot(s) : 1 x processor - LGA1155 Socket
4 x memory - DIMM 240-pin
1 x PCI Express 3.0 x16
1 x PCI Express 2.0 x1
1 x PCI
.
Your slot is 3.0 capable, but I can't tell if it enables the atomic extension. I will need to analyze your lspci output to make sure.