Don't know if you have figured this out yet or not, but what we have been informed about this is, certain versions of the Radeon Pro drivers contain some code that may cause this issue to occur when starting VMs on a host. We have had better experiences with the 18.Q3.1 driver, but a big caveat comes in how to deploy them.
Our deployment process is now this due to this issue:
1. Remove MxGPUs from all VMs on a host
2. Enter maintenance mode on the host.
3. Reboot the host.
4. This step depends how you deploy your VMs for how much time/effort this takes: Full Clones, Linked Clones, etc.
- Clean uninstall AMD Drivers from the VM(s) using the AMDCleanUninstallUtility + Shutdown the VM
- Add a GPU back to this VM (ensure it is on the rebooted host only!)
- Install 18.Q3.1 Driver
- Prep VM for re-deployment (on the rebooted host(s) only!) - if they mix with earlier versions e.g. a VM started on that host with 18.Q2 or earlier, it can cause the host to PSOD or crash.
- Repeat for all VMs / Hosts until fully deployed with 18.Q3.1
Hopefully this helps and I hope AMD gets better at sharing important details like this going forward. Not a fun moment when doing a the guest driver update caused outages when the VM boots up.
I haven't yet figured this out my hosts still randomly reboot and they reboot more often the closer I max out a GPU card with assignments I found. I will defiantly follow the above to see if it helps me out my plan will be to:
Delete all my linked clones and remove the GPU from the main image.
Put all my hosts into maintenance mode and then reboot
Uninstall GPU driver the way you mentioned and shutdown VM
Add GPU back to main VM image startup and then install 18.Q3.1 Driver and reboot
Snapshot and redeploy main image to all hosts.