We are trying to run scientific OpenCL codes under Ubuntu 16.04.5 (Kernel 4.15) Linux on Vega10 based cards (Radeon Pro WX9100, Vega 64). We use the Radeon-pro relases (amdgpu-pro bundles) as recommended on the driver download pages. Our codes are the system test of the IDTxl toolbox (GitHub - pwollstadt/IDTxl ).
These codes have run fine on R9 290X cards using the old fglrx drivers, and also do run on our fleet of nvidia cards.
On the Vega 10 based cards we get hangs of the system (but still functioning ssh access) after a random amount of time, typically a few hours, sometimes a day.
Dmesg output is always something like this:
[ +0.000004] amdgpu 0000:07:00.0: [gfxhub] VMC page fault (src_id:0 ring:154 vmid:2 pasid:0)
[ +0.000029] amdgpu 0000:07:00.0: at page 0x00000003eda00000 from 27
[ +0.000017] amdgpu 0000:07:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00201134
As on the page with the driver installation instructions it said to use the 18.20 bundle with kernel 4.15 we tried that version first, then versions 18.30/18.40/18.50 but the problem remains the same. We have tried another card (Vega64) to exclude issues with a broken hardwareof the card.
Googling around leads to many posts related to similar messages for Ryzen APUs and some bug reports that are still open, but solution that seems to work for us.
Any ideas on how to troubleshoot this issue?
(Pretty disappointing to a Pro card for 1700EUR, a recommend OS (Ubuntu 16.04), an official driver for the Pro card, but not a working system...)