It has been a long week so I’m going to try my best to describe my problem and the steps I’ve taken to try and resolve it, but I’m sure I’ll miss a lot. I’ll reply with any additional information needed.
So I recently picked up a Radeon Instinct MI60 because I’m fed up with NVIDIA and I needed an AMD card that supports ROCm.
I was able to get the card working when running it on bare metal, and it was absolutely incredible. It got me very excited with how fast it was for my use cases. I need to deploy it in a datacenter inside a virtual machine, and this is where all the troubles begin.
(I’m using Proxmox for virtualization on an AMD EPYC 7601 system with a ROMED8-2T. Above 4G decoding is enabled and so is IOMMU)
If I pass an NVIDIA GPU in to a virtual machine in the same PCIe slot, it works with absolutely 0 issues. Just need to run “sudo apt install nvidia-driver-520” and reboot and it works flawlessly.
I passed in the MI60 and ran “lispci” and it shows up in the VM just as it would on the host which seemed like a good sign. Installing the driver works just fine.
I reboot and run clinfo and rocminfo and neither show the GPU.
”sudo dkms status” shows the kernel module loaded properly.
But running rocm-smi shows
Running “sudo lspci -vnn | grep amd” shows
“Kernel driver in use: amdgpu
Kernel modules: amdgpu”
Seems like a contradiction but I kept digging. Running “sudo dmesg | grep -i amdgpu” shows
Googling “amdgpu: Fatal error during GPU init” takes me down the rout of setting up vfio settings on the host. I tried several guides for doing that, and none of them gave any different results.
The IOMMU group is set up with only the GPU in its own group with nothing else, and that group is being passed in to the virtual machine.
Operating system on the VM is Ubuntu 22.04, but I also tested with Ubuntu 20.04 and didn’t have any better luck.
I tried contacting AMD support but they told me
```You have reached out to AMD customer care where we provide technical and warranty support for AMD products and technologies.
Since your query is related to AMD ROCm, please post your query on AMD ROCm community.
```
So here I am. Let me know if there’s anything more I can provide. I’m hoping someone is able to provide some help. It’s sad to have such a great GPU sitting unused and I desperately don’t want to go back to NVIDIA.