We have two new Dell R730 servers which have same hardware and software config.
I suspect memory corruption issues in amdgpuv-1.0.5OEM driver. We have strange behaviour on our hosts in case if we are using MxGPU driver:
1. (Very often) ESXi internal process (hostd mostly) crashes unexpectedly (stop responding).
2. (Rare) Server unexpectedly reboots (PF 14). Here is example:
There is no difference are we using mxgpu-enabled VMs or not.
About two weeks ago I've started an experiment: I stay installed amdgpu driver on one host and un-installed it from the other. My first server continues to crash (about 1 crash of hostd service per day and some random errors in services), but second server works without any issue about 15 days for now. Each server has similarly workload (about 17 VMs each, DRS-enabled), datastores, paths and other configuration settings are the same. The only difference is the amd driver.
Config of our servers:
BIOS version: 2.6.0 (i.e. without last Intel's fixes. I already tried new 2.7.0 - no difference)
Please help. Crashes are really annoying!
Solved! Go to Solution.
We have the same issue, VMware support provided us with this workaround while waiting for 6.5 Update 2 to be released.
Putty to an ESXi host with an S7150X card installed
Navigate to /etc/vmware/hostd
vi config.xml
**Note :: Before making the changes below, please take a backup of the config.xml file. **
- Navigate to the section below -
<plugins>
<statssvc> (this section should already exist)
- Add the following line within the <statssvc> section and before </statssvc> -
<collectGpuStats> false </collectGpuStats>
After adding this line, save the config.xml file and restart the hostd service
/etc/init.d/hostd restart
We have the same issue (only #1) Vmware stops responing (only reset of the whole system works)
Our config:
Please try this driver set:
https://support.amd.com/en-us/download/workstation?os=VMware%20vSphere%20ESXi%206.5#pro-driver
We are going to try the 18.Q1 version, but we allready running 1.05 of the Driver for ESXi 6.5
We tried 18.Q1; but when driver is being installed the whole system hangs again (only hard-reset)
p.s. i noticed some posts regarding a Firmware update, is there an update? partnr: 100-505722
We just tested 18.Q1 on a clean OS install (VM environment) and it worked just fine. You might want to try a clean install.
Could you please clarify what version of hypervisor do you use and what version of host driver installed?
Don't think "our" crashes (hangs) don't come from the Windows driver (17Q4 / 18Q1), and has more to do with the amdgpuv-1.0.5 host driver because the host hangs (not all the time..) when starting a VM (instant) even when windows is not loaded.
We installed 18Q1 successfully, but crashes remain. ESXi full hang (2 times a week) when we try to boot a system where a MxGPU had been assigned (random VDI machine)
Today I've got confirmation from VMware, that my PSOD 14 it is a bug and it will be fixed in 6.5.0 Update 2. They say that this bug is rare and amd driver is catalysing this behaviour.
Bug was introduced in ESXi 6.5 U1 build-5969303, previous build was ESXi 6.5.0d build-5310538.
Could you try to install this old ESXi build and see is there any difference?
(I will try to install this build later on Sunday, but your system crashes much faster than mine so we will get results faster ).
We don't have any PSOD 14, but we can try it.
We have the same issue, VMware support provided us with this workaround while waiting for 6.5 Update 2 to be released.
Putty to an ESXi host with an S7150X card installed
Navigate to /etc/vmware/hostd
vi config.xml
**Note :: Before making the changes below, please take a backup of the config.xml file. **
- Navigate to the section below -
<plugins>
<statssvc> (this section should already exist)
- Add the following line within the <statssvc> section and before </statssvc> -
<collectGpuStats> false </collectGpuStats>
After adding this line, save the config.xml file and restart the hostd service
/etc/init.d/hostd restart
This is very interesting! Today I had message session with vmware support representative about this problem and they asked me if I installed old 6.5.0d build (sorry, but I didn't), but they didn't mentioned about such simple workaround.
I already changed config file, and will report you about progress.
Anyway, thank you for sharing it with us!
We added <collectGpuStats> false </collectGpuStats>, but still full hang (10% of the time... 90% it goes OK) (only when we start a machine with a MXGPU).