2 Replies Latest reply on Oct 11, 2018 11:39 AM by rklingaman@carroll.edu

    AMD GPU S7150 x2 Host Instability Issues

    rklingaman@carroll.edu

      I have a total of 8 of the S7150x2 cards in 4 hosts and I'm experience some host reboots that I'm trying to track down. These logs were located and I was curious if anybody had more information on them if they could be an issue:

       

      2018-07-09T17:58:58.924Z cpu8:75342)amdgpuv_log: idle_vf:952: [amdgpuv]: IDLE_GPU failed on VF3, status:0xff

      2018-07-09T17:58:58.924Z cpu8:75342)amdgpuv_log: switch_vfs_step_by_step:1130: [amdgpuv]: IDLE_GPU failed on VF3

      PCPU 28 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 28).

      2018-07-09T17:59:08.404Z cpu24:66013)@BlueScreen: PCPU 28 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 28).

      2018-06-22T04:47:44.769Z cpu47:66015)Failed to verify signatures of the following vib(s): [amdgpuv-cim]. All tardisks validated

        • Re: AMD GPU S7150 x2 Host Instability Issues
          tmerrill@mariosinacola.com

          Don't know if you have figured this out yet or not, but what we have been informed about this is, certain versions of the Radeon Pro drivers contain some code that may cause this issue to occur when starting VMs on a host.  We have had better experiences with the 18.Q3.1 driver, but a big caveat comes in how to deploy them. 

          Our deployment process is now this due to this issue:

          1. Remove MxGPUs from all VMs on a host

          2. Enter maintenance mode on the host.

          3. Reboot the host.

          4. This step depends how you deploy your VMs for how much time/effort this takes: Full Clones, Linked Clones, etc.

          • Clean uninstall AMD Drivers from the VM(s) using the AMDCleanUninstallUtility + Shutdown the VM
          • Add a GPU back to this VM (ensure it is on the rebooted host only!)
          • Install 18.Q3.1 Driver
          • Prep VM for re-deployment (on the rebooted host(s) only!) - if they mix with earlier versions e.g. a VM started on that host with 18.Q2 or earlier, it can cause the host to PSOD or crash.
          • Repeat for all VMs / Hosts until fully deployed with 18.Q3.1

           

          Hopefully this helps and I hope AMD gets better at sharing important details like this going forward. Not a fun moment when doing a the guest driver update caused outages when the VM boots up.

            • Re: AMD GPU S7150 x2 Host Instability Issues
              rklingaman@carroll.edu

              I haven't yet figured this out my hosts still randomly reboot and they reboot more often the closer I max out a GPU card with assignments I found. I will defiantly follow the above to see if it helps me out my plan will be to:

               

              Delete all my linked clones and remove the GPU from the main image.

              Put all my hosts into maintenance mode and then reboot

              Uninstall GPU driver the way you mentioned and shutdown VM

              Add GPU back to main VM image startup and then install 18.Q3.1 Driver and reboot

              Snapshot and redeploy main image to all hosts.