We have this motherboard Supermicro X10SRA-F and this 500w PSU PWS-505P-1H.
There's approx 150 of these servers with that configuration. They all use 3 GPU (AMD Firepro W4100): so it's 3 graphic cards per server. There are 4 mini display ports to dvi adapters per graphic card. It's DVI-D Dual Link 24+1 (f) Mini-displayPort (M).
There is also some PureLink FX D1050
The OS is currently Linux RHEL7.6, but we are open to update to get newer kernel. At the moment it's 3.10.0-957.5.1.
Mostly that system is used to output firefox browser in 12 displays in some train stations. It uses xorg.
The problems encountered are erratic "GPU stale" , gpu crashes which leads to some of these display being black. When that occurs, the other W4100 still works.
We look for weeks and need your advanced input and opinion.
1) I discussed with Supermicro for another case, and he had some Bios recommendations and PCI slots recommendations for Multi-GPU. Can you kindly give me the latest recommendations and explanations?
In particular they said we have to enable "4g decoding" for multi gpu.
I also wondered if the pci slots mattered (the X16 and X8, X8)
2) I read the Power supply requirements for one W4100, it says minimum 500w. But we are using 3 of them. Do think think our Supermicro PSU is too weak?
Is there a place in the bios where I could diagnose that from any logs? Maybe in the OS?
3) We haven't updated the supermicro BIOS because there's no release notes. We requested them to the retailler, so he asked Supermicro Sales. So far nothing. We have an old version 2.0C there is a newer 2.1.
4) We are using the radeon drivers xorg-x11-drv-ati version 18.1.0-1. We know there is a newer version (19.0.1-2) we are planning to update it.
5) Do you think that for example if one is using pci X16 and others pci X8 it could be a problem?
6) Here is an example of problems in the logs:
@timestamp2019-10-14T09:19:08.862Zmessageradeon 0000:02:00.0: GPU lockup (current fence id 0x000000000b8b6d9f last fence id 0x000000000b8b70c5 on ring 3)hostXXXseveritywarningfacilitykernsyslog-tagkernel:sourcekernel@timestamp_received2019-10-14T09:19:09.177Zlogsene_orig_typeevents
and another one:
@timestamp2019-10-13T16:07:02.342Zmessageradeon 0000:03:00.0: GPU softreset: 0x0000004Dhostbr-XXXseverityinfofacilitykernsyslog-tagkernel:sourcekernel@timestamp_received2019-10-13T16:07:02.671Zlogsene_orig_typeevents
If you have anything which can help, we are willing to listen.
Thanks in advance,
I'm wondering if you got any kind resolution? Unfortunately, I don't really have better news for you, except to say, I share your pain.
I just upgrade to fedora 31 w/ xorg-x11-drv-ati-19.0.1-3.fc31.x86_64. And, I have an old:
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV620 LE [Radeon HD 3450]
Even with the 19.0.1-3 driver, I experienced periodic lockups, which I did not experience with fedora 20. I see plenty of other folks with similar symptoms. It is sad that AMD don't maintain its drivers for older cards. At this point, I think I would rather try a different brand that does maintain its legacy products. At least, unlike you, I have only one card to worry about.
In any case, it's pretty much the same errors other people see:
Feb 18 14:56:39 succubus kernel: radeon 0000:01:00.0: ring 0 stalled for more than 10446msec Feb 18 14:56:39 succubus kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000621bef last fence id 0x0000000000621c01 on ring 0) Feb 18 14:56:39 succubus kernel: radeon 0000:01:00.0: failed to get a new IB (-35) Feb 18 14:56:39 succubus kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib !
Once, you get to this point, you are lucky if you can still save your work. In any case, just wanted to share and see if you have news. For 150 servers, I hope AMD will at least do something for you.
Best of lucks!