Don't know if you have figured this out yet or not, but what we have been informed about this is, certain versions of the Radeon Pro drivers contain some code that may cause this issue to occur when starting VMs on a host. We have had better experiences with the 18.Q3.1 driver, but a big caveat comes in how to deploy them.
Our deployment process is now this due to this issue:
1. Remove MxGPUs from all VMs on a host
2. Enter maintenance mode on the host.
3. Reboot the host.
4. This step depends how you deploy your VMs for how much time/effort this takes: Full Clones, Linked Clones, etc.
- Clean uninstall AMD Drivers from the VM(s) using the AMDCleanUninstallUtility + Shutdown the VM
- Add a GPU back to this VM (ensure it is on the rebooted host only!)
- Install 18.Q3.1 Driver
- Prep VM for re-deployment (on the rebooted host(s) only!) - if they mix with earlier versions e.g. a VM started on that host with 18.Q2 or earlier, it can cause the host to PSOD or crash.
- Repeat for all VMs / Hosts until fully deployed with 18.Q3.1
Hopefully this helps and I hope AMD gets better at sharing important details like this going forward. Not a fun moment when doing a the guest driver update caused outages when the VM boots up.