cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

idtxl_goettingen
Journeyman III

VMC page fault on Radeon pro WX9100

We are trying to run scientific OpenCL codes under Ubuntu 16.04.5 (Kernel 4.15) Linux on Vega10 based cards (Radeon Pro WX9100, Vega 64). The card is installed in a HPZ620 Workstation with a Xeon processor from the Sandy bridge generation (no PCIe atomics). Therefore, we use the Radeon-pro relases (amdgpu-pro bundles) and the pal opencl driver as recommended on the driver download pages. Our codes are the systemtests_ of the IDTxl toolbox (GitHub - pwollstadt/IDTxl ). These codes have run fine on R9 290X cards using the old fglrx drivers, and also do run on our  nvidia cards.

 

On the Vega 10 based cards we get hangs of the system (but still functioning ssh access) after a random amount of time, typically a few hours, sometimes a day, sometimes a few minutes only. A remote shutdown -r is (not always) successful.

 

Dmesg output after the crash is always something like this:

[ +0.000004] amdgpu 0000:07:00.0: [gfxhub] VMC page fault (src_id:0 ring:154 vmid:2 pasid:0)
[ +0.000029] amdgpu 0000:07:00.0: at page 0x00000003eda00000 from 27
[ +0.000017] amdgpu 0000:07:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00201134
....repeating....

 

As on the page with the driver installation instructions it said to use the 18.20 bundle with kernel 4.15 we tried that version first, then versions 18.30/18.40/18.50 but the problem remains the same. We have tried another card (Vega64) to exclude issues with a broken hardware of the card.

 

Googling around leads to many posts related to similar messages for Ryzen APUs and some bug reports that are still open, but solution that seems to work for us.

 

Any ideas on how to troubleshoot this issue? Recommendations on a more stable OS+driver combo? Could I use the rocm opencl stack with our hardware (it is not clear to me in how far and to what degree the requirement of PICe atomics has been lifted)?

MW

7 Replies
dipak
Big Boss

Hi Michael,

Thank you for reporting it.

From the above description, it looks to me a driver installation/compatibility issue. I see, you have already reported the same in our "Driver and Software" support forum here: VMC page faults running linux OpenCl code on Radeon Pro WX9100  .

I've reported it to the concerned team. Once I get their reply, I'll share with you. You can also expect a reply directly from the team. 

Thanks.

0 Likes

Any updates @dipak ? I am too, interested in knowing the fix.

0 Likes

Can you please share the current setup information where you observed the same error?

Please note, the latest AMDGPU-Pro driver is available here: amdgpu-unified-linux-21-50-2 and supported OS versions are below:

  • Ubuntu 20.04.4 HWE
  • Ubuntu 18.04.5(6) HWE
  • RHEL/CentOS 7.9
  • RHEL/CentOS 8.5
  • RHEL/CentOS 8.4
  • SLED/SLES 15 SP 3

Please try this driver if you are not using it.

Thanks.

 

0 Likes

I am using Ubuntu 18.04 with kernel 4.15.0-99-generic and radeon software for linux 18.50. This is an OpenCL application.

kernel: amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:6 pasid:32779, for process ListCameras pid 4002 thread ListCameras pid 4005)
kernel: amdgpu 0000:03:00.0:   in page starting at address 0x00000004049fe000 from 27
kernel: amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0060113D

 

Is 18.04.1 not supported by AMDGPU drivers anymore? Kernel 4.15 is currently supported by Ubuntu.

0 Likes

I am using Ubuntu 18.04 with kernel 4.15.0-99-generic and radeon software for linux 18.50. 

It seems like you are using a very old driver. So, I would suggest you to try the latest driver to see if the fix is already available there. 

Is 18.04.1 not supported by AMDGPU drivers anymore? Kernel 4.15 is currently supported by Ubuntu.

It is recommended to install the driver on a compatible OS as mentioned in the release note. As per the AMDGPU-Pro 21-50-2 release note, the compatible Ubuntu versions are 20.04.4 HWE and 18.04.5(6) HWE. 

Thanks.

0 Likes

Thanks for your suggestion. I will try using the 18.04.6 kernel with amdgpu 21.50-2.

> It is recommended to install the driver on a compatible OS as mentioned in the release note

I would like to know more about the support lifecycle of Radeon software for Linux.

1. When AMD announces a new Radeon software for Linux version, are the older driver versions considered unsupported?

2. When a new version of Radeon software for Linux drops support for an older kernel (say, linux 4.15 on ubuntu 18.04.1), does that mean older linux distributions are permanently unsupported thereafter?

 

0 Likes

1. When AMD announces a new Radeon software for Linux version, are the older driver versions considered unsupported?

You can use an older driver as long as it works for you. However, if there was an issue in an older version, and it has been fixed in the current driver,  then you need to update your driver to get the fix. 

 

When a new version of Radeon software for Linux drops support for an older kernel (say, linux 4.15 on ubuntu 18.04.1), does that mean older linux distributions are permanently unsupported thereafter?

Yes. Once a newer driver drops support for an older kernel, usually future driver releases will not be compatible with the older kernels anymore.  AMD drivers are designed to work best for up-to-date operating systems. When a new driver is released, the release-note mentions a list of compatible operating systems where the driver is expected to work fine.