We have two new Dell R730 servers which have same hardware and software config.
I suspect memory corruption issues in amdgpuv-1.0.5OEM driver. We have strange behaviour on our hosts in case if we are using MxGPU driver:
1. (Very often) ESXi internal process (hostd mostly) crashes unexpectedly (stop responding).
2. (Rare) Server unexpectedly reboots (PF 14). Here is example:
There is no difference are we using mxgpu-enabled VMs or not.
About two weeks ago I've started an experiment: I stay installed amdgpu driver on one host and un-installed it from the other. My first server continues to crash (about 1 crash of hostd service per day and some random errors in services), but second server works without any issue about 15 days for now. Each server has similarly workload (about 17 VMs each, DRS-enabled), datastores, paths and other configuration settings are the same. The only difference is the amd driver.
Config of our servers:
- ESXi 6.5.0 Update 1 Patch 38 (build-7526125) (latest available)
- Dell PowerEdge R730
- Two Intel(R) Xeon(R) CPU E5-2643 v4 @ 3.40GHz
- BRCM 10GbE 2P 57810S-t Adapter
- BRCM GbE 4P 5720-t rNDC (integrated)
- PERC H330 Mini (Embedded)
- Two 1100W PSU
- AMD S7150x2 adapter
- Horizon View 7.3.2
- Windows 10 1709 VMs
BIOS version: 2.6.0 (i.e. without last Intel's fixes. I already tried new 2.7.0 - no difference)
Please help. Crashes are really annoying!