cancel
Showing results for 
Search instead for 
Did you mean: 

Graphics Cards

cjb801
Adept I

RHEL 9.1 + W7800 + OpenCL issues

Hello,

I have a machine that we've been running for about a year now with dual W6800 graphics cards using RHEL 9.x (currently 9.1).  Everything has been working fine.  We recently upgraded to the W7800 cards and have not been able to get them to work (for OpenCL at all). We have tried the latest drivers (which do not list support for W7800) as well as the 23.Q3 drivers (found via Google) which do list support for the W7800 (and OpenCL). Here are the following issues:

1) There were numerous install/uninstall attempts which result in the computer simply not detecting the W7800s via OpenCL.  There are errors printed out during the bootup process.

2) After installing with the OpenCL "use case" (via amdgpu-install) the computer no longer boots at all.  It appears that shortly after the start of the bootup process there is an overtemperture error printed to the screen and the computer turns off entirely.  It is really hard to tell if this is the exact error because it turns off so fast that I don't have time to read it carefully.  After removing the W7800s, the computer boots fine.  The cards are not hot at all so that is clearly not the issue.  I am assuming that the driver is not reading the correct registers on the W7800 and thus detecting excessively high temperatures and pulling the fire alarm.

I replaced the W7800s with W6800s again and everything is fine, the cards are detected properly.  At the moment the W7800s are essentially paperweights for us.

Any advice would be appreciated.  Also, what is the schedule for proper support/drivers in Linux for W7800 and OpenCL (i.e., without having to use a driver that is effectively an isolated branch of the main line)?

0 Likes
19 Replies
fsadough
Moderator

0 Likes

Hello,

I had partial success but not complete success so I will describe what I did here:

0) Initial state is that the computer is working with two W6800 cards in it.

1) Remove current amgpu instance, install 23.20 (linked above).  Reboot and verified that W6800s are working with clinfo.

2) Remove one W6800 and replace with one W7800 (now the computer has one of each card. 

3) Reboot and see one W6800 and one W7800 with clinfo.

4) Turn off computer and install second W7800, check with clinfo and zero devices are found.

5) Check dmesg and see "amdgpu: Fatal error during GPU init" (among a few other non error messages which may or may not be important?  If additional info is required then I can provide a photo)

So, it worked in the 1 x W6800, 1 x 7800 case.  I have not tested with only one W7800 but can do it if that will help debug the issue.

Any help would be appreciated.

Thanks

  1. I need the exact make and model of your motherboard and CPU.
  2. Try installing one W7800 and run sudo lspci -vvv
0 Likes

I have the output from lspci -vvv prepared but I cannot post it here because it is too long.  How can I send it to you?

0 Likes

use wetransfer

0 Likes

Hello???

0 Likes
cjb801
Adept I

GIGABYTE MZ72-HB0  (2 x 64-core Epycs)  I can tell you which slots if that matters after I pull one of the boards out.  I will do that later today.

0 Likes
cjb801
Adept I

If you look at the manual shown here: https://download.gigabyte.com/FileList/Manual/server_manual_MZ72-HB0_e_v10.pdf

On page 6 there is a diagram of the board layout.  We are using slot indicated by "40" and the slot indicated by "36" there.

 

0 Likes

Can you try slot 36 & 37?

0 Likes
cjb801
Adept I

The boards will not fit in slot 36 and 37 because the slots are too close together. The W7800 is a two-slot pitch card.

0 Likes
cjb801
Adept I

Here is the lspci with one W7800 per your request.  "clinfo" is also included, shown at the top.

https://file.io/Z8uhubbyO9Tu

0 Likes

Thank you. I filed a ticket for our Software team:
EXSWEUIT-1400

0 Likes

Hello,

Is there anything I can do with that ticket number to tell the status?

Also, how can I tell when new driver versions are released?  I don't see how anyone is supposed to know about 13.20 since the website still sends the user to the "23.Q3" driver. There does not appear to be any sort of page that explains all of the new driver releases....

 

0 Likes

No, it is only for my own reference to double check later on. We monitor the progress every week. I will update you.

0 Likes

  • Do you have any other system you can try the GPUs on?
  • The issue seems to be related to a dual socket system.
0 Likes

Is this something that will be addressed in a future driver update (hopefully near-future)?

I have systems deployed already with the dual socket so it would be a major expense to replace them to support the upgraded GPUs....

I don't have another system to test with at the moment, but may in the near future.

0 Likes

The issue is being investigated. I was not able to repro the issue using a single-socket system. 

0 Likes

I would be happy to host an engineer or two at our facility if you want to send someone out.   Otherwise, if you have any testing or debugging you want me to do (even test builds) then I can try them out and send results back.  I assume you can find my email address through my account.  Thank you

0 Likes

I sent you an email.

0 Likes