We are currently running AMD Firepro s7150 x2 on vSphere 6.5. We configured 4 virtual graphic card "slots" on each core resulting in 8 "slots" in total.
We have the following setup:
Server: HPE xl250a gen9
OS: VMware ESXi 6.5.0 7388607
Host GPU Driver: Radeon™ Pro Software for VMware vSphere® (ESXi™) 6.5, v1.05
virtual GPU Driver: Radeon™ Pro Software Enterprise Edition for Windows® 10 (64-bit), v18.Q1.1
After a while we could see weird behaviour of the grapic cards:
Suddenly some of the virtual slots stopped working. Every virtual machine started having a certain virtual graphic slot configured suddenly crashed as soon as the drivers were loaded. The only way to fix the virtual graphics slot was to restart the whole ESXi server. Once this issue occured we even had the server stopped working showing a purple screen of death with the message "PF Exception 14 in world 66174:amdgpuv_work ..."
Another weird issue was when suddenly one of the cores didn't worked at all. We were able to assign the virtual graphics slot to virtual machines and could successfully boot them, but the card was not showed as active in the Windows device manager and reported the following error message: "Windows has stopped this device because it has reported problems. (Code 43)"
Everyone else facing issues like this?
I am seeing nearly identical issues/symptoms but I am using XenServer on Dell R730
I have ticket with AMD support but they are taking a long time between responses and haven't provided any useful info as of yet in almost a week and a half.
I may be scrapping the whole S7150x2 plan and have switch to Nvidia (I don't want to pay their fee for the driver though), but at least they have decent support.
Amazing after almost 2 years this mxgpu still appears not to be ready for production use.
Maybe fsadough will help us, looking at these forums he seems to be the guru on these things.
Do you have a ticket number? Are you an end-user or a company?
Have you tried the solution in this thread: https://community.amd.com/message/2847680#comment-2847680?q=1.0.5
Actually, we had to set the <collectGpuStats> false </collectGpuStats> , because every 10 days the ESXi timed out when responding to vCenter hearbeats what gave us odd HA-failover events. Unfortunately, it did not resolve the PSOD problems.
If you are a company, please PM me your contact information
Same issue here. OK it's been a long time since last interaction referencing this issue - but even 1 year after - it has obviously not solved?! I was able to find a way to make this prob reproducable. Set 4 Partitions per GPU - use Windows 10 with least driver in a Vsphere 6.5 (latest Version) on an ESXi Host with 6.5 (latest Version) and install Benchmark Application "Performance Test" by PassMark Pty (i used Version 8.0)
Run 3D Direct X11 Test - now you have a high chance in getting a PSOD with Server Restart. I accessed the VM via Horizon View Software.
This issue was not reproducable when GPUs were partioned to 5 or 8 Partitions per GPU?!
So, there seems to be an issue when 4 Partitions per GPU are set - but this is - of course - just an assumption by observation. I would appreciate if other users could test with 5 instead of 4 partitions and monitor the behaviour...plz give feedback about your experience.
#2019/05/07 Note (a few days later): Partitioning over more than 4 per GPU did not solve this issue for us - it just takes a little bit more time unitl Server crashes with PSOD. Driver seems to be absolutely unstable. I Re-read the manual again and again but cannot find something i could have forgotten....?!
Please AMD, could you give us a feedback about this issue!!!
#2019/12/07 Note : After repartitioning the GPUs a few times going back to 4 partitions per gpu and re-setting the pcihole-entries automatically by using sh mxgpuinstall.sh -c, the system seems - at a first glance - to be stable?! Multiple benchmarks are running now like a charm without sudden Server restarts as experienced 1 week before - and working with complex 3D Models in AutoCad is working now without interuptions. The strange thing ... i didn't changed anything on system side nor within the vm - except the mentioned repartitioning of the gpus?!
I will have a look on it with more vms and give feedback here if i get closer to the root cause of the problem...
Hi Dackermann
We seem to be experiencing the same issues with our Servers randomly crashing. We have been in contact with Dell they have found in the logs the issue is with the AMD Graphic’s cards. I have raised a ticket with AMD, but they are very slow to respond as other people have mentioned.
Please could you provide an update with the fix you have applied and maybe tell me where I would need to change the pcihole settings. We have 7 partitions per card.
Thank you in advance for any help you can provide.
Hi Hawes29,
the issue is still not gone 😞 therefore root cause has still not been detected .. an yes, there ist definetively a relation to the driver framework of the S7150x2 ADM cards. We could observe that 3D enabled applications like Firefox mostly cause a server crash. We are currently monitoring the behavior with changed settings (disabled 3D acceleration of most applications which do not need 3D capabilites like IE, Firefox, Chrome, Acrobat a.s.o....
And yes - AMD did not help to follow up this issue - they told me to contact HPE Services - the result: HPE confirmed that the crash is caused by AMD Graphic Cards aka VMWare AMD Driver
I do not understand why AMD seems so less interested in solving this issue. Getting it working could be a game changer in the future of modern Desktop Delivering Scenarios.
Hi Dackermann
Thank you for your reply.
We use Dell R740 Servers, and Dell support has provided the logs that pinpoint the problem with the AMD cards. I have provided AMD with these logs, and they have now escalated the issue, and they are working with their developers. I will keep you posted.
hawes29, did you ever get to the bottom of this? We are seeing these problems and can not find any answers.
Are you still having issues with this card or were they solved? We're experiencing a lot of Purple Screens with these cards and Citrix XenApp.