We are trying to run scientific OpenCL codes under Ubuntu 16.04.5 (Kernel 4.15) Linux on Vega10 based cards (Radeon Pro WX9100, Vega 64). The card is installed in a HPZ620 Workstation with a Xeon processor from the Sandy bridge generation (no PCIe atomics). Therefore, we use the Radeon-pro relases (amdgpu-pro bundles) and the pal opencl driver as recommended on the driver download pages. Our codes are the systemtests_ of the IDTxl toolbox (GitHub - pwollstadt/IDTxl ). These codes have run fine on R9 290X cards using the old fglrx drivers, and also do run on our nvidia cards.
On the Vega 10 based cards we get hangs of the system (but still functioning ssh access) after a random amount of time, typically a few hours, sometimes a day, sometimes a few minutes only. A remote shutdown -r is (not always) successful.
Dmesg output after the crash is always something like this:
[ +0.000004] amdgpu 0000:07:00.0: [gfxhub] VMC page fault (src_id:0 ring:154 vmid:2 pasid:0)
[ +0.000029] amdgpu 0000:07:00.0: at page 0x00000003eda00000 from 27
[ +0.000017] amdgpu 0000:07:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00201134
As on the page with the driver installation instructions it said to use the 18.20 bundle with kernel 4.15 we tried that version first, then versions 18.30/18.40/18.50 but the problem remains the same. We have tried another card (Vega64) to exclude issues with a broken hardware of the card.
Googling around leads to many posts related to similar messages for Ryzen APUs and some bug reports that are still open, but solution that seems to work for us.
Any ideas on how to troubleshoot this issue? Recommendations on a more stable OS+driver combo? Could I use the rocm opencl stack with our hardware (it is not clear to me in how far and to what degree the requirement of PICe atomics has been lifted)?