AnsweredAssumed Answered

GPU pass-through error in single GPU configuration of RADEON RX580

Question asked by masato@yoshi.dnsalias.com on Mar 30, 2020

If you pass GPU passthrough in a single GPU configuration, the first passthrough after boot will succeed, but the second passthrough will fail if you reload the admgpu driver. In addition, it succeeds when the link speed of PCI-E is set to Gen2.

motherboard : BoiStar X570GT8
The same goes for the H370 motherboard.
CPU : Ryzen 3600
Memory : 32GB
GPU : Radeon RX580
OS : Ubuntu 19.10
Kernel : 5.4.21

 

NG Case1
-----------------------------------------------------------
1. Load amdgpu kernel module
modprobe admgpu

2. Disable vtconsole
echo 0 > /sys/class/vtconsole/vtcon1/bind

3. Unload amdgpu kernel module
modprobe -r amdgpu

4. Load amdgpu kernel module
modprobe admgpu

There is the following message
amdgpu 0000: 0c: 00.0: GPU pci config reset
[drm] GPU posting now ...

5. Disable vtconsole
echo 0 > /sys/class/vtconsole/vtcon1/bind

6. Unload amdgpu kernel module
modprobe -r amdgpu

7. Start VM
virsh start Win10

There is the following message
vfio-pci 0000:0c:00.1: vfio_bar_restore: reset recovery - restoring BARs
AMD-Vi: Completion-Wait loop timed out
iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0c:00.0 address=0x80b59f680]

-----------------------------------------------------------

 

NG Case2
-----------------------------------------------------------
1. Load amdgpu kernel module
modprobe admgpu

2. Disable vtconsole
echo 0 > /sys/class/vtconsole/vtcon1/bind

3. Unload amdgpu kernel module
modprobe -r amdgpu

4. Start VM
virsh start Win10

5. Stop VM

6. Disable vtconsole
echo 0 > /sys/class/vtconsole/vtcon1/bind

7. Unload amdgpu kernel module
modprobe -r amdgpu

8. Start VM
virsh start Win10

There is the following message
vfio-pci 0000:0c:00.1: vfio_bar_restore: reset recovery - restoring BARs
AMD-Vi: Completion-Wait loop timed out
iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0c:00.0 address=0x80b59f680]


There is no error message when starting the VM in step 8 without executing steps 6 and 7 after stopping the VM in step 5.
-----------------------------------------------------------

 

NG Case3
-----------------------------------------------------------
Change amdgpu to not reset ASIC.
---
kernel/v5.4.21/linux-5.4.21/drivers/gpu/drm/amd/amdgpu/vi.c
static int vi_asic_reset(struct amdgpu_device *adev)
{
int r;

amdgpu_atombios_scratch_regs_engine_hung(adev, true);

// r = vi_gpu_pci_config_reset(adev);
r = 0;

amdgpu_atombios_scratch_regs_engine_hung(adev, false);

return r;
}
---

VM startup of NG Case1 starts normally
1. Stop VM

2. Disable vtconsole
echo 0 > /sys/class/vtconsole/vtcon1/bind

GPU pci config reset is suppressed, but there is a message below for some reason
[drm] GPU posting now...

4. Unload amdgpu kernel module
modprobe -r amdgpu

5. Start VM
virsh start Win10

There is the following message
vfio-pci 0000:0c:00.1: vfio_bar_restore: reset recovery - restoring BARs
AMD-Vi: Completion-Wait loop timed out
iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0c:00.0 address=0x80b59f680]

-----------------------------------------------------------

OK Case1
-----------------------------------------------------------
If you do not reload amdgpu, restarting the VM is fine.
However, input from the console will not be possible.
-----------------------------------------------------------


OK Case2
-----------------------------------------------------------
By fixing the PCI-E link speed to Gen1 and Gen2 in the UEFI BIOS (Gen3 and Auto are not allowed),
even if amdgpu is reloaded, restarting the VM is no problem.
However, if the link speed of PCI-E is set to Gen2, Gen2 will be used for other than the GPU.
-----------------------------------------------------------

Outcomes