Hello,
I have a machine that we've been running for about a year now with dual W6800 graphics cards using RHEL 9.x (currently 9.1). Everything has been working fine. We recently upgraded to the W7800 cards and have not been able to get them to work (for OpenCL at all). We have tried the latest drivers (which do not list support for W7800) as well as the 23.Q3 drivers (found via Google) which do list support for the W7800 (and OpenCL). Here are the following issues:
1) There were numerous install/uninstall attempts which result in the computer simply not detecting the W7800s via OpenCL. There are errors printed out during the bootup process.
2) After installing with the OpenCL "use case" (via amdgpu-install) the computer no longer boots at all. It appears that shortly after the start of the bootup process there is an overtemperture error printed to the screen and the computer turns off entirely. It is really hard to tell if this is the exact error because it turns off so fast that I don't have time to read it carefully. After removing the W7800s, the computer boots fine. The cards are not hot at all so that is clearly not the issue. I am assuming that the driver is not reading the correct registers on the W7800 and thus detecting excessively high temperatures and pulling the fire alarm.
I replaced the W7800s with W6800s again and everything is fine, the cards are detected properly. At the moment the W7800s are essentially paperweights for us.
Any advice would be appreciated. Also, what is the schedule for proper support/drivers in Linux for W7800 and OpenCL (i.e., without having to use a driver that is effectively an isolated branch of the main line)?
Please try this driver:
Radeon™ Software for Linux® 23.20 Release Notes | AMD
Hello,
I had partial success but not complete success so I will describe what I did here:
0) Initial state is that the computer is working with two W6800 cards in it.
1) Remove current amgpu instance, install 23.20 (linked above). Reboot and verified that W6800s are working with clinfo.
2) Remove one W6800 and replace with one W7800 (now the computer has one of each card.
3) Reboot and see one W6800 and one W7800 with clinfo.
4) Turn off computer and install second W7800, check with clinfo and zero devices are found.
5) Check dmesg and see "amdgpu: Fatal error during GPU init" (among a few other non error messages which may or may not be important? If additional info is required then I can provide a photo)
So, it worked in the 1 x W6800, 1 x 7800 case. I have not tested with only one W7800 but can do it if that will help debug the issue.
Any help would be appreciated.
Thanks
I have the output from lspci -vvv prepared but I cannot post it here because it is too long. How can I send it to you?
use wetransfer
Hello???
GIGABYTE MZ72-HB0 (2 x 64-core Epycs) I can tell you which slots if that matters after I pull one of the boards out. I will do that later today.
If you look at the manual shown here: https://download.gigabyte.com/FileList/Manual/server_manual_MZ72-HB0_e_v10.pdf
On page 6 there is a diagram of the board layout. We are using slot indicated by "40" and the slot indicated by "36" there.
Can you try slot 36 & 37?
The boards will not fit in slot 36 and 37 because the slots are too close together. The W7800 is a two-slot pitch card.
Here is the lspci with one W7800 per your request. "clinfo" is also included, shown at the top.
Thank you. I filed a ticket for our Software team:
EXSWEUIT-1400
Hello,
Is there anything I can do with that ticket number to tell the status?
Also, how can I tell when new driver versions are released? I don't see how anyone is supposed to know about 13.20 since the website still sends the user to the "23.Q3" driver. There does not appear to be any sort of page that explains all of the new driver releases....
No, it is only for my own reference to double check later on. We monitor the progress every week. I will update you.
Is this something that will be addressed in a future driver update (hopefully near-future)?
I have systems deployed already with the dual socket so it would be a major expense to replace them to support the upgraded GPUs....
I don't have another system to test with at the moment, but may in the near future.
The issue is being investigated. I was not able to repro the issue using a single-socket system.
I would be happy to host an engineer or two at our facility if you want to send someone out.
I sent you an email.