Some GPU crashed (gpu lockup, gpu stale) in Multi GPU setup using three Firepro W4100 in Linux
Hi, We have this motherboard Supermicro X10SRA-F and this 500w PSU PWS-505P-1H. There's approx 150 of these servers with that configuration. They all use 3 GPU (AMD Firepro W4100): so it's 3 graphic cards per server. There are 4 mini display ports to dvi adapters per graphic card. It's DVI-D Dual Link 24+1 (f) Mini-displayPort (M).
There is also some PureLink FX D1050
The OS is currently Linux RHEL7.6, but we are open to update to get newer kernel. At the moment it's 3.10.0-957.5.1.
Mostly that system is used to output firefox browser in 12 displays in some train stations. It uses xorg.
The problems encountered are erratic "GPU stale" , gpu crashes which leads to some of these display being black. When that occurs, the other W4100 still works.
We look for weeks and need your advanced input and opinion.
1) I discussed with Supermicro for another case, and he had some Bios recommendations and PCI slots recommendations for Multi-GPU. Can you kindly give me the latest recommendations and explanations? In particular they said we have to enable "4g decoding" for multi gpu.
I also wondered if the pci slots mattered (the X16 and X8, X8)
2) I read the Power supply requirements for one W4100, it says minimum 500w. But we are using 3 of them. Do think think our Supermicro PSU is too weak? Is there a place in the bios where I could diagnose that from any logs? Maybe in the OS?
3) We haven't updated the supermicro BIOS because there's no release notes. We requested them to the retailler, so he asked Supermicro Sales. So far nothing. We have an old version 2.0C there is a newer 2.1.
4) We are using the radeon drivers xorg-x11-drv-ati version 18.1.0-1. We know there is a newer version (19.0.1-2) we are planning to update it.
5) Do you think that for example if one is using pci X16 and others pci X8 it could be a problem?
6) Here is an example of problems in the logs:
@timestamp2019-10-14T09:19:08.862Zmessageradeon 0000:02:00.0: GPU lockup (current fence id 0x000000000b8b6d9f last fence id 0x000000000b8b70c5 on ring 3)hostXXXseveritywarningfacilitykernsyslog-tagkernel:sourcekernel@timestamp_received2019-10-14T09:19:09.177Zlogsene_orig_typeevents
and another one: @timestamp2019-10-13T16:07:02.342Zmessageradeon 0000:03:00.0: GPU softreset: 0x0000004Dhostbr-XXXseverityinfofacilitykernsyslog-tagkernel:sourcekernel@timestamp_received2019-10-13T16:07:02.671Zlogsene_orig_typeevents
If you have anything which can help, we are willing to listen.