I have a total of 8 of the S7150x2 cards in 4 hosts and I'm experience some host reboots that I'm trying to track down. These logs were located and I was curious if anybody had more information on them if they could be an issue:
2018-07-09T17:58:58.924Z cpu8:75342)amdgpuv_log: idle_vf:952: [amdgpuv]: IDLE_GPU failed on VF3, status:0xff
2018-07-09T17:58:58.924Z cpu8:75342)amdgpuv_log: switch_vfs_step_by_step:1130: [amdgpuv]: IDLE_GPU failed on VF3
PCPU 28 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 28).
2018-07-09T17:59:08.404Z cpu24:66013)@BlueScreen: PCPU 28 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 28).
2018-06-22T04:47:44.769Z cpu47:66015)Failed to verify signatures of the following vib(s): [amdgpuv-cim]. All tardisks validated
Don't know if you have figured this out yet or not, but what we have been informed about this is, certain versions of the Radeon Pro drivers contain some code that may cause this issue to occur when starting VMs on a host. We have had better experiences with the 18.Q3.1 driver, but a big caveat comes in how to deploy them.
Our deployment process is now this due to this issue:
1. Remove MxGPUs from all VMs on a host
2. Enter maintenance mode on the host.
3. Reboot the host.
4. This step depends how you deploy your VMs for how much time/effort this takes: Full Clones, Linked Clones, etc.
Clean uninstall AMD Drivers from the VM(s) using the AMDCleanUninstallUtility + Shutdown the VM
Add a GPU back to this VM (ensure it is on the rebooted host only!)
Install 18.Q3.1 Driver
Prep VM for re-deployment (on the rebooted host(s) only!) - if they mix with earlier versions e.g. a VM started on that host with 18.Q2 or earlier, it can cause the host to PSOD or crash.
Repeat for all VMs / Hosts until fully deployed with 18.Q3.1
Hopefully this helps and I hope AMD gets better at sharing important details like this going forward. Not a fun moment when doing a the guest driver update caused outages when the VM boots up.
I haven't yet figured this out my hosts still randomly reboot and they reboot more often the closer I max out a GPU card with assignments I found. I will defiantly follow the above to see if it helps me out my plan will be to:
Delete all my linked clones and remove the GPU from the main image.
Put all my hosts into maintenance mode and then reboot
Uninstall GPU driver the way you mentioned and shutdown VM
Add GPU back to main VM image startup and then install 18.Q3.1 Driver and reboot