cancel
Showing results for 
Search instead for 
Did you mean: 

ROCm Discussions

LotisHunters
Journeyman III

Radeon Instinct MI60 in virtual machine

It has been a long week so I’m going to try my best to describe my problem and the steps I’ve taken to try and resolve it, but I’m sure I’ll miss a lot. I’ll reply with any additional information needed.

So I recently picked up a Radeon Instinct MI60 because I’m fed up with NVIDIA and I needed an AMD card that supports ROCm.

I was able to get the card working when running it on bare metal, and it was absolutely incredible. It got me very excited with how fast it was for my use cases. I need to deploy it in a datacenter inside a virtual machine, and this is where all the troubles begin.

(I’m using Proxmox for virtualization on an AMD EPYC 7601 system with a ROMED8-2T. Above 4G decoding is enabled and so is IOMMU)

If I pass an NVIDIA GPU in to a virtual machine in the same PCIe slot, it works with absolutely 0 issues. Just need to run “sudo apt install nvidia-driver-520” and reboot and it works flawlessly.

I passed in the MI60 and ran “lispci” and it shows up in the VM just as it would on the host which seemed like a good sign. Installing the driver works just fine.

I reboot and run clinfo and rocminfo and neither show the GPU.

”sudo dkms status” shows the kernel module loaded properly.

But running rocm-smi shows

81343893-F9CB-45B4-816A-E87ED634EE43.png

Running “sudo lspci -vnn | grep amd” shows

436F37FF-B591-43F0-A4CC-613F18DCE7A3.png

Kernel driver in use: amdgpu

Kernel modules: amdgpu”

Seems like a contradiction but I kept digging. Running “sudo dmesg | grep -i amdgpu” shows

5169F7BF-8A5E-4D6C-8476-934BD7B8BC0A.png

 

 

 

 

 

Googling “amdgpu: Fatal error during GPU init” takes me down the rout of setting up vfio settings on the host. I tried several guides for doing that, and none of them gave any different results.

The IOMMU group is set up with only the GPU in its own group with nothing else, and that group is being passed in to the virtual machine.

Operating system on the VM is Ubuntu 22.04, but I also tested with Ubuntu 20.04 and didn’t have any better luck.

I tried contacting AMD support but they told me 

```You have reached out to AMD customer care where we provide technical and warranty support for AMD products and technologies.

Since your query is related to AMD ROCm, please post your query on AMD ROCm community.
```

So here I am. Let me know if there’s anything more I can provide. I’m hoping someone is able to provide some help. It’s sad to have such a great GPU sitting unused and I desperately don’t want to go back to NVIDIA.

0 Likes
2 Replies

I notice there isn't much activity in AMD ROCM Forum.

This link from AMD ROCM "Prerequisite Actions" might be helpful if unless you have read this already: https://docs.amd.com/en-US/bundle/ROCm-Installation-Guide-v5.2/page/Prerequisite_Actions.html

You also might want to open a thread at Github ROCM Forum from here: https://github.com/RadeonOpenCompute/ROCm/issues

From the Github link there was this link to install AMD with a script: https://amdgpu-install.readthedocs.io/en/latest/install-script.html

I posted the above link because maybe there was a problem with the installation process.

Here is the latest AMD GPU driver for the MI60 from AMD Download page: https://www.amd.com/en/support/server-accelerators/amd-instinct/amd-instinct-mi-series/instinct-mi60

NOTE: As you can tell I have no idea about ROCM or MI60 Accelerator GPU cards.

0 Likes

I appreciate the reply!

I have tried all of those steps listed but you’re right that opening a GitHub thread/issue might be the best next step. I’ll go ahead and do that later this weekend.