cancel
Showing results for 
Search instead for 
Did you mean: 

Graphics

andrey_ra
Journeyman III

S7150 drivers suspects in memory corruption (amdgpuv-1.0.5OEM, ESXi6.5)

We have two new Dell R730 servers which have same hardware and software config.

I suspect memory corruption issues in amdgpuv-1.0.5OEM driver. We have strange behaviour on our hosts in case if we are using MxGPU driver:

1. (Very often) ESXi internal process (hostd mostly) crashes unexpectedly (stop responding).

2. (Rare) Server unexpectedly reboots (PF 14). Here is example:

ESXi1_2018-01-25T021004_0300.png

There is no difference are we using mxgpu-enabled VMs or not.

About two weeks ago I've started an experiment: I stay installed amdgpu driver on one host and un-installed it from the other. My first server continues to crash (about 1 crash of hostd service per day and some random errors in services), but second server works without any issue about 15 days for now. Each server has similarly workload (about 17 VMs each, DRS-enabled), datastores, paths and other configuration settings are the same. The only difference is the amd driver.

Config of our servers:

  • ESXi 6.5.0 Update 1 Patch 38 (build-7526125) (latest available)
  • Dell PowerEdge R730
  • Two Intel(R) Xeon(R) CPU E5-2643 v4 @ 3.40GHz
  • BRCM 10GbE 2P 57810S-t Adapter
  • BRCM GbE 4P 5720-t rNDC (integrated)
  • PERC H330 Mini (Embedded)
  • Two 1100W PSU
  • AMD S7150x2 adapter
  • Horizon View 7.3.2
  • Windows 10 1709 VMs

BIOS version: 2.6.0 (i.e. without last Intel's fixes. I already tried new 2.7.0 - no difference)


Please help. Crashes are really annoying!

0 Likes
1 Solution

We have the same issue, VMware support provided us with this workaround while waiting for 6.5 Update 2 to be released.

Putty to an ESXi host with an S7150X card installed

Navigate to /etc/vmware/hostd

vi config.xml

**Note :: Before making the changes below, please take a backup of the config.xml file. **

- Navigate to the section below -

<plugins>

      <statssvc> (this section should already exist)

- Add the following line within the <statssvc> section and before </statssvc> -

<collectGpuStats> false </collectGpuStats>

After adding this line, save the config.xml file and restart the hostd service

/etc/init.d/hostd restart

View solution in original post

13 Replies
psict
Journeyman III

We have the same issue (only #1) Vmware stops responing (only reset of the whole system works)

Our config:

  • Supermicro SuperServer 1028GR-TRT     (bios 2.0c)
  • S7150X2 (partnr: 100-505722)
  • VMware ESXi, 6.5.0, 6765664
  • Horizon View 7.3.2
  • Windows 10 1703

Super

0 Likes
fsadough
Moderator
0 Likes

We are going to try the 18.Q1 version, but we allready running 1.05 of the Driver for ESXi 6.5

0 Likes
psict
Journeyman III

We tried 18.Q1; but when driver is being installed the whole system hangs again (only hard-reset)

p.s. i noticed some posts regarding a Firmware update, is there an update? partnr: 100-505722

0 Likes

We just tested 18.Q1 on a clean OS install (VM environment) and it worked just fine. You might want to try a clean install.

0 Likes

Could you please clarify what version of hypervisor do you use and what version of host driver installed?

0 Likes

Don't think "our" crashes (hangs) don't come from the Windows driver (17Q4 / 18Q1), and has more to do with the amdgpuv-1.0.5 host driver because the host hangs (not all the time..) when starting a VM (instant) even when windows is not loaded.

0 Likes
psict
Journeyman III

We installed 18Q1 successfully, but crashes remain. ESXi full hang (2 times a week) when we try to boot a system where a MxGPU had been assigned (random VDI machine)

0 Likes

Today I've got confirmation from VMware, that my PSOD 14 it is a bug and it will be fixed in 6.5.0 Update 2. They say that this bug is rare and amd driver is catalysing this behaviour.

Bug was introduced in ESXi 6.5 U1 build-5969303, previous build was ESXi 6.5.0d build-5310538.

Could you try to install this old ESXi build and see is there any difference?

(I will try to install this build later on Sunday, but your system crashes much faster than mine so we will get results faster ).

0 Likes

We don't have any PSOD 14, but we can try it.

0 Likes

We have the same issue, VMware support provided us with this workaround while waiting for 6.5 Update 2 to be released.

Putty to an ESXi host with an S7150X card installed

Navigate to /etc/vmware/hostd

vi config.xml

**Note :: Before making the changes below, please take a backup of the config.xml file. **

- Navigate to the section below -

<plugins>

      <statssvc> (this section should already exist)

- Add the following line within the <statssvc> section and before </statssvc> -

<collectGpuStats> false </collectGpuStats>

After adding this line, save the config.xml file and restart the hostd service

/etc/init.d/hostd restart

This is very interesting! Today I had message session with vmware support representative about this problem and they asked me if I installed old 6.5.0d build (sorry, but I didn't), but they didn't mentioned about such simple workaround.

I already changed config file, and will report you about progress.

Anyway, thank you for sharing it with us!

0 Likes

We added <collectGpuStats> false </collectGpuStats>, but still full hang (10% of the time... 90% it goes OK) (only when we start a machine with a MXGPU).

0 Likes