I have a total of 8 of the S7150x2 cards in 4 hosts and I'm experience some host reboots that I'm trying to track down. These logs were located and I was curious if anybody had more information on them if they could be an issue:
2018-07-09T17:58:58.924Z cpu8:75342)amdgpuv_log: idle_vf:952: [amdgpuv]: IDLE_GPU failed on VF3, status:0xff
2018-07-09T17:58:58.924Z cpu8:75342)amdgpuv_log: switch_vfs_step_by_step:1130: [amdgpuv]: IDLE_GPU failed on VF3
PCPU 28 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 28).
2018-07-09T17:59:08.404Z cpu24:66013)@BlueScreen: PCPU 28 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 28).
2018-06-22T04:47:44.769Z cpu47:66015)Failed to verify signatures of the following vib(s): [amdgpuv-cim]. All tardisks validated
Don't know if you have figured this out yet or not, but what we have been informed about this is, certain versions of the Radeon Pro drivers contain some code that may cause this issue to occur when starting VMs on a host. We have had better experiences with the 18.Q3.1 driver, but a big caveat comes in how to deploy them.
Our deployment process is now this due to this issue:
1. Remove MxGPUs from all VMs on a host
2. Enter maintenance mode on the host.
3. Reboot the host.
4. This step depends how you deploy your VMs for how much time/effort this takes: Full Clones, Linked Clones, etc.
Hopefully this helps and I hope AMD gets better at sharing important details like this going forward. Not a fun moment when doing a the guest driver update caused outages when the VM boots up.