cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jcoiner
Adept I

Ubuntu + ROCm 1.6 + Bristol Ridge success

This post is about issues encountered, and a workaround for them, while getting ROCm 1.6 working on a Bristol Ridge system. tl;dr lots of things would hang until I disabled MSI interrupts for the amdgpu kernel module, then everything worked fine.

The system:

- Asrock A320M Pro4 motherboard, bios 3.00, with IOMMUv2 enabled in the BIOS. This board allows the OS to recognize and configure the IOMMUv2.

- Bristol Ridge A8-9600

- Ubuntu 16.04 updated with "hwe" kernel and Xorg

- ROCm 1.6 including the ROCK 4.11 kernel

The symptoms:

- Virtual console switching is quite slow, about 10s to switch to another console with ALT-F

- Xorg hangs almost immediately or is glacially slow. The desktop never finishes drawing.

- Warnings on the console about vblank and pflip timeouts

- Warnings in dmesg like "[CRTC:40] vblank wait timed out"

- Warnings in dmesg like "do_IRQ: 0.147 No irq handler for vector".

- The sample "vector_copy" HSA program distributed with ROCm sometimes hangs after "Dispatching the kernel succeeded", and sometimes runs ok.

The fix was to disable MSI. The amdgpu module has an "msi" param which when set to 0 disables MSI interrupt routing. This fixed all of the above symptoms, and allows Xorg and HSA apps to run reliably.

For more detail, see this (now closed) bug against ROCm: Bristol Ridge.- Asrock A320M PRO4 system: vector_copy runs OK from console, but hangs in an ssh term...

Here's what I know:

- The HSA runtime polls for completion for a short time before sleeping. If the GPU finished its work during the poll, vector_copy would not hang. If the GPU did not finish during the poll, the host program would call hsaKmtWaitOnEvent() which tells the kernel module to awaken the thread after a GPU interrupt. That interrupt would never arrive at the amdgpu driver, causing the program to hang.

- The dmesg entry "do_IRQ: 0.147 No irq handler for vector" is bad news. It means some device is trying to interrupt the CPU, and interrupts setup has gone wrong somehow such that the OS cannot route the interrupt to the corresponding device driver.

Open question: why don't MSI interrupts work well for this GPU device in this system? My guess is that the fault isn't with the amdgpu module itself, as it does almost nothing with the 'msi' bit other than relay it to the generic PCI code in the Linux kernel. I haven't tried to debug this further.

Hope this helps someone.

0 Replies